Sponsor of the Day:
Jerkmate
https://commoncrawl.org/errata/extra-line-in-response-records-between-headers-and-payload
Common Crawl - Erratum - Redundant extra line in response records
The WARC files of the August 2018 crawl contain a redundant empty line between the HTTP headers and the payload of WARC response records.
common crawl erratumredundantextralineresponse
https://commoncrawl.org/errata/truncated-wat-files
Common Crawl - Erratum - Truncated WAT Files
Four WAT files of the March 2017 crawl (CC-MAIN-2017-13) are truncated, potentially causing an error when processing them.
common crawl erratumtruncatedwatfiles
https://commoncrawl.org/errata/missing-content-truncated-flag-in-url-indexes
Common Crawl - Erratum - Missing content_truncated flag in URL indexes
The flag in our URL indexes (CDX and columnar) that indicates whether or not a WARC record payload was truncated was added in CC-MAIN-2019-47. This indicator...
common crawl erratummissing contenttruncatedflagurl
https://commoncrawl.org/errata/redirect-target-url-in-url-indexes-may-be-a-relative-url
Common Crawl - Erratum - Redirect target URL in URL indexes may be a relative URL
When the HTTP “Location” header includes a relative URL, the corresponding “redirect” field in the CDX index and “fetch_redirect” field in the columnar index...
common crawl erratumredirecttargeturlindexes
https://commoncrawl.org/errata/charset-detection-bug-in-wet-records
Common Crawl - Erratum - Charset Detection Bug in WET Records
The charset detection required to properly transform non-UTF-8 HTML pages in WARC files into WET records didn't work before November 2016 (CC-MAIN-2016-50) due...
common crawl erratumcharsetdetectionbugwet