Skip to main content

Job Opportunities at the Internet Archive

    Web Crawl Engineer

    Web Archiving Software Engineer

    Head of Digitization

    Book Scanner

    Senior Application Developer: Archive.org

    Senior Engineer: Wayback Machine

    About the Internet Archive

    The Internet Archive is a non-profit with a huge mission: to give everyone access to all knowledge the books, web pages, audio, television and software of our shared human culture. Forever. Based in San Francisco and with satellites around the world, the Internet Archive staffers are building the digital library of the future--a place where anyone can go to learn and explore. Our 160 engineers, book scanners, librarians, designers and team members have built the 250+ most popular website in the world. (https://archive.org) Internet Archive is a non-profit digital library offering free universal access to books, movies & music, as well as 464 billion archived web pages.

Web Crawl Engineer

Location: Inner Richmond, San Francisco, CA or Remote

Job Classification: : Full-time, Exempt

Job Summary: The Internet Archive is seeking a Web Crawl Engineer for its Web Archiving Group. Our crawl engineering team is responsible for capturing and managing the highest quality content from the web. An ideal candidate demonstrates independence and initiative, is a problem solver, works well autonomously, and is technologically savvy. Additionally, the ideal candidate is open to being trained on, and helping advance, best practices and standards around large-scale web harvests, web data processing and engineering, and contributing to the development of new harvesting, access, and analysis tools.

The position will work in the Web Archiving Group in support of web harvesting services and programs working with partners ranging from national libraries and archives to collaborative international initiatives supporting the collection, preservation, and accessibility of web content. The role will help design the strategy and implementation of web archiving services using open source technologies and platforms, develop harvest techniques and tools to enable archival capture and re-rendering of rich media, streaming content, social media, as well as traditional web page content. The position will also create tools, services, and workflows to improve crawl analysis, reports, data management and derivation, and identify technical, operational and data analysis requirements. This role contributes to defining deployment architectures and workflows, managing data at scale, and monitoring production systems.

Essential Job Functions:

  • Running large-scale web harvests on global and national domain levels and focused and specialized crawls using Heritrix, our open-source crawler, as well as other open-source technologies developed internally, including Umbra, Brozzler, warcprox and others.
  • Configuration, monitoring, and improvement of large-scale, multi-machine web crawls to ensure their quality and timely completion.
  • Processing, analysis and quality assurance of archived web content to ensure it is complete and of the highest quality.
  • Contribute to development of tools for automated analysis and reporting of crawl material, and to development projects focused on crawling, processing, and access.
  • Manage both large ingests and exports of web data, derivatives, logs, and reports.
  • Demonstrated experience of delivering on commitments with deadlines and project time lines and working in a collaborative team of engineers and project/product managers.

Minimum Qualifications:

  • Experience with web crawlers or scrapers, especially Heritrix
  • Proven experience in Unix shell scripting and Python coding required
  • Solid experience in Internet protocols (HTTP is must.) Strong knowledge of HTML, JavaScript and Web technologies in general
  • Knowledge of building and deploying web applications, databases, web-host services, and knowledge of basic Linux system administration
  • Ability to work in, and enjoy, a loosely structured work environment

Preferred Qualifications:

  • Cluster computing experience is preferred, especially familiarity with Hadoop and related technologies and tools
  • Experience or familiarity with Java strongly preferred
  • Experience with applications designed to display archived web content, especially server-side apps and Wayback
  • Experience with development environments and system monitoring/administration tools
  • Experience with open source practices, version control, and code review
  • Experience with Atlassian tool sets
  • Flexibility and a sense of humor are a plus

Requirements: Bachelor's Degree in Computer Science or a related field, five years of progressively responsible experience in software development.

Reporting Structure: The Web Crawl Engineer reports to the Director of Web Archiving and works closely with other departments. The position works alongside other web archiving engineers as well as program staff in Web Archiving Group and with the broader Internet Archive infrastructure and engineering teams.

To Apply: Please send your resume and cover letter to jobs+crawlengineer@archive.org with the subject line "Web Crawl Engineer."

Internet Archive reserves the right to revise job descriptions or work hours as required.

Internet Archive is an Equal Opportunity Employer and a 501(c)(3) non profit library founded in 1996.

The Archive will consider for employment-qualified applicants with criminal histories in a manner consistent with the requirements of the Fair Chance Ordinance.

Web Archiving Software Engineer

Location: Inner Richmond, San Francisco, CA or Remote

Job Classification: Full-time, Exempt

Job Summary: The Internet Archive has over 24PB of unique digital information, all running across an integrated cluster of over 700 VMs on 500+ bare-metal hosts in 3 data centers. We are looking for a smart engineer with experience in defining and building service APIs. The ideal candidate will also have experience creating software that interacts with systems at high transaction rates while delivering reliability and performance of both internal and public-facing web applications. All candidates must be able to work collaboratively within our Web Archiving team of talented engineers and program staff.

Essential Job Functions:

  • Build, test, and package APIs for the transfer of data out of a repository of web archive files
  • Consume external APIs to enable the ingest of external data into web archive files
  • Deploy, administer, and tune tools that support the software development infrastructure and data management and processing environments used within the Web Archiving group
  • Analyze, manage, transfer, and maintain large amounts of archival data in multiple environments
  • Participate in monitoring, maintaining, and restoring the health of the storage and computer cluster and key processes and services related to crawling, indexing, and access to archived web content

Minimum Qualifications:

  • Fluency in Linux environments, scripting and/or programming skills, development of custom tool integrations
  • Proven experience in Unix shell scripting and Python required
  • Demonstrated experience building or working with APIs
  • Experience deploying and administering database, search, and web-host services
  • Proven experience open source practices, participation in open source forums, and staying current with industry trends
  • BS in Computer Science, or equivalent work experience

Preferred Qualifications:

  • Familiarity configuration of software development environments and cluster administration tools, including Git, ELK stack and monitoring tools: Nagios, Graphite, Grafana, etc
  • Knowledge of evolving database or analytics tools, especially Hadoop, Druid, or RethinkDB
  • Experience or familiarity with Java is a plus
  • Experience with Atlassian tool sets
  • MS in Computer Science or equivalent work experience
  • Flexibility and a sense of humor

Reporting Structure: The Web Archiving Software Engineer reports to the Director of Engineering and works closely with the Director, Web Archiving Programs. The position will also work alongside other systems, applications, and QA engineers as well as program staff in Web Archiving Programs team.

To Apply: Please send your resume and cover letter to jobs+webarchivingengineer@archive.org with the subject line "Web Archiving Software Engineer."

Internet Archive reserves the right to revise job descriptions or work hours as required.

Internet Archive is an Equal Opportunity Employer and a 501(c)(3) non profit library founded in 1996.

The Archive will consider for employment-qualified applicants with criminal histories in a manner consistent with the requirements of the Fair Chance Ordinance.

Head of Digitization

Location: Location: San Francisco, CA, remote possible.

Job Classification: : Full-time, Exempt

Senior management position to manage and expand our digitization of millions of books, audio records, films and videotapes to build one of the world's largest digital libraries.

Reporting to the Digital Librarian, the Head of Digitization will have overall strategic and operational responsibility for Internet Archive's 70+ digitization staff in 8 countries, programs, expansion, and execution of the group's mission.

This requires managing people, setting up facilities, creating production processes, and working through process improvements.

Responsibilities

  • Triple production rates and expand media types efficiently digitized
  • Build production processes and manage to them
  • Manage 70+ staff and volunteers working in libraries in multiple countries
  • Manage the contracts and operate a couple remote "super scanning center"
  • Work closely with our partner libraries and vendors
  • Identify and resolve bottlenecks within the workflow that hinders quality, partner satisfaction, and efficiency.
  • Develop new and interesting partnerships
  • Track and communicate production throughput and productivity.
  • Project manage hardware/software releases across all scanning operations
  • Develop strong working relationships with engineering, finance, administration, and HR teams
  • Qualifications:

  • Engineering mindset and approach to production processes
  • Worked internationally in setting up and operating factories
  • Track record of effectively leading and scaling a performance-based organization and staff
  • Unrelenting commitment to quality and efficiency
  • Desire to travel
  • Ability to work effectively with diverse groups of employees and library partners
  • Passion, integrity, positive attitude, mission-driven, and self-directed
  • Engineering degree with at least 5 years of senior management experience.
  • To Apply: Please send your resume and cover letter to hr@archive.org with the subject line "Head of Digitization."

    Internet Archive reserves the right to revise job descriptions or work hours as required.

    Internet Archive is an Equal Opportunity Employer and a 501(c)(3) non profit library founded in 1996.

    The Archive will consider for employment-qualified applicants with criminal histories in a manner consistent with the requirements of the Fair Chance Ordinance.

    Book Scanner

    Location: Location: Washington DC area

    Job Classification: : Full-time, Non-Exempt

    The Book Scanning Operator "Scanner" digitizes and helps de-bug the scanning process in the Internet Archive scanning centers. The Internet Archive has an immediate opening for a Scanner in the Washington DC area.

    Desired Qualifications:

  • High tolerance for repetitive tasks.
  • Attention to detail.
  • Ability to assess image quality and if a page has been skipped.
  • Average computer skills.
  • Willingness to do first level of troubleshooting.
  • Ability to communicate with others about problems or solutions.
  • Must be able to sit/stand at a scanning device constantly.
  • Patience and a natural curiosity about how things work is required.
  • Previous imaging experience is not necessary.
  • This is a non-exempt hourly position. Benefits include; medical, dental, FSA/DCA, 403B, LTD, life insurance.

    To Apply: Please send your resume and cover letter to hr@archive.org with the subject line "Scanner DC."

    Internet Archive reserves the right to revise job descriptions or work hours as required.

    Internet Archive is an Equal Opportunity Employer and a 501(c)(3) non profit library founded in 1996.

    The Archive will consider for employment-qualified applicants with criminal histories in a manner consistent with the requirements of the Fair Chance Ordinance.

    Senior Application Developer: Archive.org

    Location: San Francisco, CA

    Job Classification: Full-time, exempt

    Job Summary: The Internet Archive has a huge corpus of digital information. Every day, our team of development engineers creates tools and applications that help our users to access and work with 22 petabytes of content that includes millions of books and texts, millions of hours of video, millions of audio tracks, and over 450 billion web captures. We are looking for smart engineers to help develop next generation of web-based applications and tools that will be used by libraries and archives around the world to build and manage curated collections of books, texts, web, and image content. The ideal candidate will be a strong programmer who has successfully led and completed several projects involving large or intricate web applications or services, and who works collaboratively with talented engineering colleagues.

    Key Responsibilities:
    • The responsibilities of this position are to be part of the team that will maintain and evolve the Archive.org web site. More specifically, this means:
    • Work at the direction of the technical project lead to continue to evolve and enhance the next generation of the archive.org web site.

    Minimum Qualifications:

    • Passion for delivering delightful end-user experiences when interacting with delivered web applications and services.
    • Extensive work experience with Javascript, HTML5, and CSS.
    • Extensive experience developing applications and websites in PHP
    • Work history that includes integrating front end user interfaces with search, database , and business logic to create integrated applications and services.
    • Experience working with digital media files and metadata structures
    • Experience developing and maintaining structured APIs
    • Good understanding of latest web framework technologies and protocols
    • Fluency in Linux environments
    • Flexibility and a sense of humor

    Preferred Qualifications:
    • Strong programming experience Python.
    • Experience open source practices and participation in open source forums
    • Experience working with time-based digital media (audio and video).
    • Specific experience with Atlassian tool sets (Jira, Confluence)

    Reporting Structure:The Web Application Developer reports to the Director of Engineering and will work closely with the web archiving and TV archiving teams. The entire staff is guided by founder and Digital Librarian, Brewster Kahle.

    To Apply:Please send your resume and cover letter to Jobs+Seniorapplicationdeveloper@archive.org with the subject line "AE-106: Web Application Developer."

    Internet Archive reserves the right to revise job descriptions or work hours as required.

    Internet Archive is an Equal Opportunity Employer and a 501(c)(3) non profit library founded in 1996.

    The Archive will consider for employment-qualified applicants with criminal histories in a manner consistent with the requirements of the Fair Chance Ordinance.

    Senior Engineer: Wayback Machine

    Location: San Francisco, CA

    Job Classification: Full-time, exempt

    Job Summary:The Internet Archive's Wayback Machine is the world's largest public archive of historical web sites. Have you ever wanted to work with 450 billion things at once? Would you like to serve 1,500 requests per second? How about having your service referred to regularly in news articles and blog posts across the web? You can work on a challenging and popular project and help the world at the same time.

    We are looking for a smart, collaborative and resourceful engineer to help develop the next version of the Wayback Machine. The ideal candidate will possess a desire to work collaboratively with a small internal team and a large, vocal and active user community; demonstrating independence, creativity, initiative and technological savvy, in addition to being a great programmer/architect.

    Minimum Qualifications:

    • 2-3 years work experience in Python, or similar
    • Experience working in Linux environments
    • Familiarity with Java (current deployment is written in Java)
    • Good understanding of latest web framework technologies and aspects of web technology and protocols
    • Flexibility and a sense of humor
    • BS Computer Science, or equivalent work experience

    Preferred Qualifications:

    • Experience with web crawlers and/or applications designed to display archived web content (especially server-side apps)
    • Cluster computing experience
    • Open source practices experience

    To Apply: Please send your resume and cover letter to Jobs+SeniorWaybackEngineer@archive.org with the subject line "Wayback Machine Senior Engineer."

    Internet Archive reserves the right to revise job descriptions or work hours as required.

    Internet Archive is an Equal Opportunity Employer and a 501(c)(3) non profit library founded in 1996.

    The Archive will consider for employment-qualified applicants with criminal histories in a manner consistent with the requirements of the Fair Chance Ordinance.