Python offers a lot of powerful and easy to use tools for scraping websites. One of Python's useful modules to scrape websites is known as Beautiful Soup.
- Scraping Html Data With Beautifulsoup
- Web Scraping Using Beautifulsoup
- Beautiful Soup Web Scraping Example Using
Selenium versus BeautifulSoup for Web Scraping. Prints all the links from a website with specific element (for example: python) mentioned in the link. Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites BeautifulSoup is one popular library provided by Python to scrape data from the web. To get the best out of it, one needs only to have a basic knowledge of HTML, which is covered in the guide. In this tutorial, we’ll show you how to perform web scraping using Python 3 and the Beautiful Soup library. We’ll be scraping weather forecasts from the National Weather Service, and then analyzing them using the Pandas library. But to be clear, lots of programming languages can be used to scrape the web! We also teach web scraping in R. The Python libraries requests and Beautiful Soup are powerful tools for the job. If you like to learn with hands-on examples and you have a basic understanding of Python and HTML, then this tutorial is for you. In this tutorial, you’ll learn how to: Use requests and Beautiful Soup for scraping and parsing data from the Web. Beautiful Soup: Beautiful Soup is a library (a set of pre-writen code) that give us methods to extract data from websites via web scraping Web Scraping: A technique to extract data from websites. With that in mind, we are going to install Beautiful Soup to scrap a website, Best CD Price to fetch the data and store it into a.csv file.
In this example we'll provide you with a Beautiful Soup example, known as a 'web scraper'. This will get data from a Yahoo Finance page about stock options. It's alright if you don't know anything about stock options, the most important thing is that the website has a table of information you can see below that we'd like to use in our program. Below is a listing for Apple Computer stock options.
First we need to get the HTML source for the page. Beautiful Soup won't download the content for us, we can do that with Python's
urllib
module, one of the libraries that comes standard with Python.Fetching the Yahoo Finance Page
2 4 | optionsUrl='http://finance.yahoo.com/q/op?s=AAPL+Options' |
2 4 | optionsUrl='http://finance.yahoo.com/q/op?s=AAPL+Options' |
Scraping Html Data With Beautifulsoup
Web Scraping Using Beautifulsoup
This code retrieves the Yahoo Finance HTML and returns a file-like object.
If you go to the page we opened with Python and use your browser's 'get source' command you'll see that it's a large, complicated HTML file. It will be Python's job to simplify and extract the useful data using the
BeautifulSoup
module. BeautifulSoup
is an external module so you'll have to install it. If you haven't installed BeautifulSoup
already, you can get it here.Beautiful Soup Example: Loading a Page
The following code will load the page into
BeautifulSoup
:2 | soup=BeautifulSoup(optionsPage) |
Beautiful Soup Example: Searching
Now we can start trying to extract information from the page source (HTML). We can see that the options have pretty unique looking names in the 'symbol' column something like
AAPL130328C00350000
. The symbols might be slightly different by the time you read this but we can solve the problem by using BeautifulSoup
to search the document for this unique string.Let's search the
soup
variable for this particular option (you may have to substitute a different symbol, just get one from the webpage):2 | [u'AAPL130328C00350000'] |
This result isn’t very useful yet. It’s just a unicode string (that's what the 'u' means) of what we searched for. However
BeautifulSoup
returns things in a tree format so we can find the context in which this text occurs by asking for it's parent node like so:2 | >>>soup.findAll(text='AAPL130328C00350000')[0].parent <ahref='/q?s=AAPL130328C00350000'>AAPL130328C00350000</a> |
We don't see all the information from the table. Let's try the next level higher.
2 | >>>soup.findAll(text='AAPL130328C00350000')[0].parent.parent <td><ahref='/q?s=AAPL130328C00350000'>AAPL130328C00350000</a></td> |
And again.
2 | >>>soup.findAll(text='AAPL130328C00350000')[0].parent.parent.parent <tr><td nowrap='nowrap'><ahref='/q/op?s=AAPL&amp;k=110.000000'><strong>110.00</strong></a></td><td><ahref='/q?s=AAPL130328C00350000'>AAPL130328C00350000</a></td><td align='right'><b>1.25</b></td><td align='right'><span id='yfs_c63_AAPL130328C00350000'><bstyle='color:#000000;'>0.00</b></span></td><td align='right'>0.90</td><td align='right'>1.05</td><td align='right'>10</td><td align='right'>10</td></tr> |
Bingo. It's still a little messy, but you can see all of the data that we need is there. If you ignore all the stuff in brackets, you can see that this is just the data from one row.
2 4 | [x.text forxiny.parent.contents] foryinsoup.findAll('td',attrs={'class':'yfnc_h','nowrap':'}) |
This code is a little dense, so let's take it apart piece by piece. The code is a list comprehension within a list comprehension. Let's look at the inner one first:
foryinsoup.findAll('td',attrs={'class':'yfnc_h','nowrap':'}) |
This uses
BeautifulSoup
's findAll
function to get all of the HTML elements with a td
tag, a class of yfnc_h
and a nowrap of nowrap
. We chose this because it's a unique element in every table entry.If we had just gotten
td
's with the class yfnc_h
we would have gotten seven elements per table entry. Another thing to note is that we have to wrap the attributes in a dictionary because class
is one of Python's reserved words. From the table above it would return this:![Python Python](/uploads/1/3/7/1/137191167/967570105.jpg)
Beautiful Soup Web Scraping Example Using
<td nowrap='nowrap'><a href='/q/op?s=AAPL&amp;k=110.000000'><strong>110.00</strong></a></td> |
We need to get one level higher and then get the text from all of the child nodes of this node's parent. That's what this code does:
This works, but you should be careful if this is code you plan to frequently reuse. If Yahoo changed the way they format their HTML, this could stop working. If you plan to use code like this in an automated way it would be best to wrap it in a try/catch block and validate the output.
This is only a simple Beautiful Soup example, and gives you an idea of what you can do with HTML and XML parsing in Python. You can find the Beautiful Soup documentation here. You'll find a lot more tools for searching and validating HTML documents.