7 | | * Do we only want to download openly licensed information? |
8 | | * Should we have clever ways to dump APIs? |
9 | | * Do we respect robots.txt even for openly licensed information? |
| 7 | * Do we only want to download openly licensed information? ANS: no, we do everything (though do need to think about this re. IP issues) |
| 8 | * Should we have clever ways to dump APIs? ANS: no. |
| 9 | * Do we respect robots.txt even for openly licensed information? ANS: No (we're not crawling we're archiving) |
| 10 | * Use HTTP/1.1 Caching headers? ANS: if not changed since we last updated don't bother to recache. |
| 11 | * Complete support for ETags |
| 12 | * Expires, Max-Age etc. |
| 13 | * Check |
| 20 | Process: |
| 21 | |
| 22 | 1. [Archiver.Update checks queue (automated as part of celery)] |
| 23 | 2. Open url and get any info from resource on cache / content-length etc |
| 24 | 1. If FAILURE status: update task_status table (could retry if not more than 3 failures so far). Report task failure in celery |
| 25 | 2. Check headers for content-length and content-type ... |
| 26 | * IF: content-length > max_content_length: EXIT (store outcomes on task_status, and update resource with size and content-type and any other info we get?) |
| 27 | * ELSE: check content-type. |
| 28 | * IF: NOT data stuff (e.g. text/html) then EXIT. (store outcomes and info on resource) |
| 29 | * ELSE: archive it (compute md5 hash etc) |
| 30 | * IF: get hash from headers and hash unchanged GOTO step 4 |
| 31 | * IF: get content-length and content-length unchanged GOTO step 4 |
| 32 | * IF: max-age / expires / other cache headers show this has not changed since last check GOTO step 4 |
| 33 | 3. Archive it: connect to storage system and store it. Bucket: from config, Key: /{timestamp}/{resourceid}/filename.ext |
| 34 | * Add cache url to resource and updated date |
| 35 | * Add other relevant info to resource such as md5, content-type etc |
| 36 | 4. Update task_status |
| 37 | |