User:Δεριζαματζορπρομπλεμιναυστραλια/PrototypingnetworkedmediaSniff Scrape Crawl: Difference between revisions
No edit summary |
No edit summary |
||
Line 27: | Line 27: | ||
found this code to open item in command line | found this code to open item in command line | ||
# open-webpage.py | # open-webpage.py | ||
import urllib2 | import urllib2 | ||
url = 'https://archive.org/details/dasleidenunsersh00bras' | url = 'https://archive.org/details/dasleidenunsersh00bras' | ||
response = urllib2.urlopen(url) | response = urllib2.urlopen(url) | ||
webContent = response.read() | webContent = response.read() | ||
print webContent[0:500] | print webContent[0:500] | ||
Line 43: | Line 39: | ||
# save-webpage.py | # save-webpage.py | ||
import urllib2 | import urllib2 | ||
url = 'https://archive.org/details/bplill' | url = 'https://archive.org/details/bplill' | ||
response = urllib2.urlopen(url) | response = urllib2.urlopen(url) | ||
webContent = response.read() | webContent = response.read() | ||
f = open('bplill.html', 'w') | f = open('bplill.html', 'w') | ||
f.write(webContent) | f.write(webContent) |
Revision as of 21:10, 29 June 2014
archive.org
Items include multiple files dublin core or fuller marxml records that use the L.o.C. marc21 format for bibliographic data
Information about every file in this directory by viewing the file ending in _files.xml , all of the metadata for the item, view the file ending in _meta.xml :
Identifier
The globally unique ID of a given item on archive.org
collections and items all have a unique identifier
(taken by the Title field of the entry)
item: 251735-autopsy-0001-optimized collection: opensource_Afrikaans
Archive.org and python
found this code to open item in command line
# open-webpage.py
import urllib2 url = 'https://archive.org/details/dasleidenunsersh00bras' response = urllib2.urlopen(url) webContent = response.read() print webContent[0:500]
and then this for html output .
# save-webpage.py
import urllib2 url = 'https://archive.org/details/bplill' response = urllib2.urlopen(url) webContent = response.read() f = open('bplill.html', 'w') f.write(webContent) f.close
internetarchive 0.6.6 =A python interface to archive.org
1) ia for using the archive from the command line
2)internet archive for programmatic access to the archive
Accessing an IA Collection in Python
to see number of items:
import internetarchive
search = internetarchive.Search('collection:nasa') print search.num_found