User:Francg/expub/thesis/prototype: Difference between revisions

From XPUB & Lens-Based wiki
No edit summary
No edit summary
 
(5 intermediate revisions by the same user not shown)
Line 5: Line 5:
'''Prototype'''
'''Prototype'''


Extracting data (in this case I just scrap URL's / web links) from:
Extracting data (scrapping URL's / web links from content only)  
<br> e.g. the website where I have my thesis outline stored: [https://pzwiki.wdka.nl/mediadesign/User:Francg/expub/thesis/thesis-outline Thesis Outline]
<br>from: https://www.reddit.com/


<br>
<br>
Line 12: Line 12:
</center>
</center>


Run Python (I did it from virtual environment)
Run Python (I did it from virtual environment in my laptop)
<br>then following these commands:
 
<br>from bs4 import BeautifulSoup
<br>from bs4 import BeautifulSoup
<br>import requests
<br>import requests
<br>url = raw_input("Enter a website to extract the URL's from: ")
<br>url = raw_input("https://www.reddit.com/: ")
<br>r  = requests.get("http://" +url)
<br>r  = requests.get("https://www.reddit.com/" +url)
<br>data = r.text
<br>data = r.text
<br>soup = BeautifulSoup(data)
<br>soup = BeautifulSoup(data)
<br>for link in soup.find_all('a'):
<br>for link in soup.find_all('a'):
    print(link.get('href'))
<br>    print(link.get('href'))
 
Bs4-test-reddit1-2.png
 
<img src="https://pzwiki.wdka.nl/mw-mediadesign/images/9/98/Bs4-test-reddit1.png" alt="Bs4-test-reddit1" width="250%" height="250%"/>
 
<img src="https://pzwiki.wdka.nl/mw-mediadesign/images/5/56/Bs4-test-reddit1-2.png" alt="Bs4-test-reddit1-2" width="250%" height="250%"/>
 
<img src="https://pzwiki.wdka.nl/mw-mediadesign/images/9/91/Bs4-test-reddit1-3.png" alt="Bs4-test-reddit1-3" width="250%" height="250%"/>

Latest revision as of 14:02, 5 October 2017


Prototype

Extracting data (scrapping URL's / web links from content only)
from: https://www.reddit.com/



Run Python (I did it from virtual environment in my laptop)
then following these commands:


from bs4 import BeautifulSoup
import requests
url = raw_input("https://www.reddit.com/: ")
r = requests.get("https://www.reddit.com/" +url)
data = r.text
soup = BeautifulSoup(data)
for link in soup.find_all('a'):
print(link.get('href'))

Bs4-test-reddit1-2.png

Bs4-test-reddit1

Bs4-test-reddit1-2

Bs4-test-reddit1-3