| 7 | | * Do we only want to download openly licensed information? |
| 8 | | * Should we have clever ways to dump APIs? |
| 9 | | * Do we respect robots.txt even for openly licensed information? |
| | 7 | * Do we only want to download openly licensed information? ANS: no, we do everything (though do need to think about this re. IP issues) |
| | 8 | * Should we have clever ways to dump APIs? ANS: no. |
| | 9 | * Do we respect robots.txt even for openly licensed information? ANS: No (we're not crawling we're archiving) |
| | 10 | * Use HTTP/1.1 Caching headers? ANS: if not changed since we last updated don't bother to recache. |
| | 11 | * Complete support for ETags |
| | 12 | * Expires, Max-Age etc. |
| | 13 | * Check |
| | 20 | Process: |
| | 21 | |
| | 22 | 1. [Archiver.Update checks queue (automated as part of celery)] |
| | 23 | 2. Open url and get any info from resource on cache / content-length etc |
| | 24 | 1. If FAILURE status: update task_status table (could retry if not more than 3 failures so far). Report task failure in celery |
| | 25 | 2. Check headers for content-length and content-type ... |
| | 26 | * IF: content-length > max_content_length: EXIT (store outcomes on task_status, and update resource with size and content-type and any other info we get?) |
| | 27 | * ELSE: check content-type. |
| | 28 | * IF: NOT data stuff (e.g. text/html) then EXIT. (store outcomes and info on resource) |
| | 29 | * ELSE: archive it (compute md5 hash etc) |
| | 30 | * IF: get hash from headers and hash unchanged GOTO step 4 |
| | 31 | * IF: get content-length and content-length unchanged GOTO step 4 |
| | 32 | * IF: max-age / expires / other cache headers show this has not changed since last check GOTO step 4 |
| | 33 | 3. Archive it: connect to storage system and store it. Bucket: from config, Key: /{timestamp}/{resourceid}/filename.ext |
| | 34 | * Add cache url to resource and updated date |
| | 35 | * Add other relevant info to resource such as md5, content-type etc |
| | 36 | 4. Update task_status |
| | 37 | |