User:Δεριζαματζορπρομπλεμιναυστραλια/PrototypingnetworkedmediaSniff Scrape Crawl: Difference between revisions

From XPUB & Lens-Based wiki
No edit summary
No edit summary
Line 73: Line 73:
<source lang="python">  
<source lang="python">  
ia download 251735-autopsy-0001-optimized (or types, mp4etc)
ia download 251735-autopsy-0001-optimized (or types, mp4etc)
or
</source>
<source lang="python">
ia metadata  251735-autopsy-0001-optimized (metadata in command line)
ia metadata  251735-autopsy-0001-optimized (metadata in command line)
</source>
</source>

Revision as of 21:14, 29 June 2014

archive.org

directory listing showing all original(by user), derived(by archive.org), and metadata files

Items include multiple files dublin core or fuller marxml records that use the L.o.C. marc21 format for bibliographic data

Information about every file in this directory by viewing the file ending in _files.xml , all of the metadata for the item, view the file ending in _meta.xml :



Identifier

The globally unique ID of a given item on archive.org

collections and items all have a unique identifier

(taken by the Title field of the entry)

item: 251735-autopsy-0001-optimized collection: opensource_Afrikaans

Archive.org and python


found this code to open item in command line

 
 # open-webpage.py
import urllib2
url = 'https://archive.org/details/dasleidenunsersh00bras'
response = urllib2.urlopen(url)
webContent = response.read()
print webContent[0:500]


and then this for html output .

 
 # save-webpage.py
import urllib2
url = 'https://archive.org/details/bplill'
response = urllib2.urlopen(url)
webContent = response.read()
f = open('bplill.html', 'w')
f.write(webContent)
f.close


internetarchive 0.6.6 =A python interface to archive.org


1) ia for using the archive from the command line

2)internet archive for programmatic access to the archive

Accessing an IA Collection in Python

to see number of items:

 
 import internetarchive
search = internetarchive.Search('collection:nasa')
print search.num_found

Accessing an item

 
ia download 251735-autopsy-0001-optimized (or types, mp4etc)
 
ia metadata  251735-autopsy-0001-optimized (metadata in command line)
 
import internetarchive
item = internetarchive.Item('251735-autopsy-0001-optimized')
item.download()