Let’s say you want to find some information inside a web page and display it in a custom way in your app.
This technique is called “scraping.” Let’s also assume you’ve thought through alternatives to scraping web pages from inside your app, and are pretty sure that’s what you want to do.
Well then you get to the question – how can you programmatically dig through the HTML and find the part you’re looking for, in the most robust way possible? Believe it or not, regular expressions won’t cut it!
Well, in this tutorial you’ll find out how! You’ll get hands-on experience with parsing HTML into an Objective-C data model that your apps can use.
In fact, you’ll work with some HTML from this very site, downloading a list of tutorials and also a list of the members of the iOS Tutorial Team (who are quite awesome, if I do say so myself).
Even if you are pretty sure you never want to parse HTML in your apps, you might enjoy this tutorial anyway, because it covers some cool things you can do with XML and querying its elements with XPath.
This tutorial assumes some familiarity with Objective-C and iOS programming. If you are a complete beginner, you may wish to check out some of the other tutorials on this site.
Let’s start scraping!
Before you begin parsing/scraping web pages in your app, you should first make sure that this is really the best choice for you.
Scraping web pages from your app is not always the best choices because:
So – assuming you’ve thought this all through, and you’re *really sure* this is what you want to do, here’s how! But don’t say we didn’t warn you ;]
Getting Started: How to Climb Trees
As you’re probably aware, HTML (HyperText Markup Language) is a markup language (it’s in the name!) that tells browsers how to layout a web page. By its very nature, this content is in a hierarchy that defines where within the page a piece of information is to be displayed.
You may also be aware of XML (eXtensible Markup Language). This also defines a hierarchy of information, and you may at this point be thinking that perhaps HTML is related to XML. You’d be right to think that, and also wrong!
There are two flavors of HTML: the one that is pure XML, and the original, where-it-all-started HTML. You can read about the difference over at Wikipedia, but it’s sufficient for the purposes of this tutorial to know that HTML is “sort of” an XML document, but with more relaxed rules.
Since an XML document has a natural hierarchy in a tree structure, it makes sense to have some kind of language to describe retrieving portions of that tree. This is where XPath comes in. XPath is a language for selecting portions of an XML document. Fortunately for you, it works just as well with an HTML document.
For example, consider this portion of HTML:
<html> <head> <title>Some webpage</title> </head> <body> <p class=”normal”>This is the first paragraph</p> <p class=”special”>This is the second paragraph. <b>This is in bold.</b></p> </body> </html>
This clearly is in a tree structure that looks like this:
Based on the above diagram, if you wanted to access the title of the HTML document, then you could use the following XPath expression to walk the tree and return the corresponding node:
This would yield a node with just one child: the text “Some webpage.”
Similarly, if you wanted to access the second paragraph, you could use the following XPath expression:
This would give you access to the node that represents the portion of the tree underneath <p class=’special’>. Note that you have used the syntax [@class=‘special’] to say that you want the nodes which are at html -> body -> p, where the <p> tag has the “class” attribute set to “special.” If there were more than one <p> tag with that class, then this expression would have returned an array of the nodes. But in this case, there’s only one.
With that knowledge in hand, you can now write XPath queries to access anything within the tree!
Getting Started for Real: Libxml2 and Hpple
Parsing an XML document into a manageable format is a pretty complex process. But never fear, there is a handy little library that’s included in the iOS SDK called libxml2.
This may sound scary at first. A C library without a pretty Objective-C wrapping?
Fortunately, thanks to some excellent development, there is an open source library called hpple that wraps libxml2 nicely using Objective-C objects. Hpple wraps the creation of the XML document structure, as well as the XPath querying.
While you may feel like you have the hiccups every time you see the word, in this tutorial you will be using hpple to parse HTML.
Start Xcode and go to File\New\Project, select iOS\Application\Master Detail Application and click Next. Set up the project like so:
- Project name: HTMLParsing
- Company Identifier: Your usual reverse DNS identifier
- Class Prefix: Leave blank
- Device Family: iPhone
- Use Storyboards: No
- Use Automatic Reference Counting: Yes
- Include Unit Tests: No (you’re living life on the edge)
Click Next and finally, choose a location to save your project.
Creating the Data Model
You’re going to be downloading tutorials and contributor names from raywenderlich.com, so it would be nice to have these objects modeled in an Objective-C class for easy access. I know you like to keep your project organized, so create a group in the project called Model under the root HTMLParsing group like so (right-click on the HTMLParsing folder to get the context menu):
Next create a new file under the Model group by selecting Model, then clicking File\New\File (or right-clicking on the folder and selecting New File…). Select Cocoa Touch\Objective-C class and click Next. Enter “Tutorial” as the class, and make it a subclass of NSObject. Finally, click Next and save it along with the rest of the project.
Now select Tutorial.h and make the interface look like this:
@interface Tutorial : NSObject @property (nonatomic, copy) NSString *title; @property (nonatomic, copy) NSString *url; @end
Then select Tutorial.m and make the implementation look like this:
@implementation Tutorial @synthesize title = _title; @synthesize url = _url; @end
Now create another class, again under the Model group, and call it “Contributor.” Like before, make it a subclass of NSObject. Then make the interface and implementation look like the following:
// Interface @interface Contributor : NSObject @property (nonatomic, copy) NSString *name; @property (nonatomic, copy) NSString *imageUrl; @end // Implementation @implementation Contributor @synthesize name = _name; @synthesize imageUrl = _imageUrl; @end