Robuta

Sponsor of the Day: Jerkmate
https://commoncrawl.org/errata/extra-line-in-response-records-between-headers-and-payload Common Crawl - Erratum - Redundant extra line in response records The WARC files of the August 2018 crawl contain a redundant empty line between the HTTP headers and the payload of WARC response records. common crawl erratumredundantextralineresponse https://commoncrawl.org/errata/truncated-wat-files Common Crawl - Erratum - Truncated WAT Files Four WAT files of the March 2017 crawl (CC-MAIN-2017-13) are truncated, potentially causing an error when processing them. common crawl erratumtruncatedwatfiles https://commoncrawl.org/errata/missing-content-truncated-flag-in-url-indexes Common Crawl - Erratum - Missing content_truncated flag in URL indexes The flag in our URL indexes (CDX and columnar) that indicates whether or not a WARC record payload was truncated was added in CC-MAIN-2019-47. This indicator... common crawl erratummissing contenttruncatedflagurl https://commoncrawl.org/errata/redirect-target-url-in-url-indexes-may-be-a-relative-url Common Crawl - Erratum - Redirect target URL in URL indexes may be a relative URL When the HTTP “Location” header includes a relative URL, the corresponding “redirect” field in the CDX index and “fetch_redirect” field in the columnar index... common crawl erratumredirecttargeturlindexes https://commoncrawl.org/errata/charset-detection-bug-in-wet-records Common Crawl - Erratum - Charset Detection Bug in WET Records The charset detection required to properly transform non-UTF-8 HTML pages in WARC files into WET records didn't work before November 2016 (CC-MAIN-2016-50) due... common crawl erratumcharsetdetectionbugwet