An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.
An easy-to-use and highly customizable crawler that enables you to create your own little Web archives (WARC/CDX)