How to Parse HTML on iOS
This is a blog post by iOS Tutorial Team member Matt Galloway, founder of SwipeStack, a mobile development team based in London, UK. You can also find me on Google+. Let’s say you want to find some information inside a web page and display it in a custom way in your app. This technique is […] By Matt Galloway.
Sign up/Sign in
With a free Kodeco account you can download source code, track your progress, bookmark, personalise your learner profile and more!
Create accountAlready a member of Kodeco? Sign in
Sign up/Sign in
With a free Kodeco account you can download source code, track your progress, bookmark, personalise your learner profile and more!
Create accountAlready a member of Kodeco? Sign in
Contents
Adding the Hpple Code
Note: If you are comfortable with git, then you might want to consider doing the following by cloning the git repository locally, rather than downloading the ZIP file.
Note: If you are comfortable with git, then you might want to consider doing the following by cloning the git repository locally, rather than downloading the ZIP file.
The hpple project is hosted on GitHub, so open a browser and point it to https://github.com/topfunky/hpple. Click the “ZIP” button to download a ZIP file containing the project. Unzip it and open the resulting folder in Finder. You should see something like this:
Now create another group under the HTMLParsing group called “hpple,” and drag the TFHpple.h/.m, TFHppleElement.h/.m and XPathQuery.h/.m files to the newly created group. When you do this, make sure that you opt to copy the files to the destination group’s folder and add them to the HTMLParsing target:
Since hpple makes use of libxml2, you need to tell your project where to find the libxml2 headers, and also to link against it when building.
To do this, select the project root at the top of the project navigator, go to Build Settings and search for “header search paths.” Enter the value for the Header Search Paths row as $(SDKROOT)/usr/include/libxml2 and press Enter. It should end up looking like this:
Next select Build Phases and open the “Link Binary With Libraries” section. Click the (+) button and search for libxml2. Select libxml2.dylib and press Add. The project navigator should now look like this:
If you build and run the project now, everything should compile and link, and you’ll be presented with the standard app that’s created with the Master-Detail Application template you opted to use:
Sit on Your Arse and Parse
Now that everything is set up, go ahead and parse some HTML! Your first trick will be to parse http://www.raywenderlich.com/tutorials for a list of tutorials. If you open the site’s homepage in your favorite browser and view the source of the page, you should find something in there like this:
<div class="content-wrapper">
<h3>Beginning iPhone Programming</h3>
<ul>
<li><a href="/?p=1797">How To Create a Simple iPhone App on iOS 5 Tutorial: 1/3</a></li>
<li><a href="/?p=1845">How To Create a Simple iPhone App on iOS 5 Tutorial: 2/3</a></li>
<li><a href="/?p=1888">How To Create a Simple iPhone App on iOS 5 Tutorial: 3/3</a></li>
<li><a href="/?p=10209">My App Crashed – Now What? 1/2</a></li>
<li><a href="/?p=10505">My App Crashed – Now What? 2/2</a></li>
<li><a href="/?p=8003">How to Submit Your App to Apple: From No Account to App Store, Part 1</a></li>
<li><a href="/?p=8045">How to Submit Your App to Apple: From No Account to App Store, Part 2</a></li>
</ul>
</div>
Note: A lot of irrelevant code has been trimmed out for clarity.
Note: A lot of irrelevant code has been trimmed out for clarity.
If you draw that in tree format, you come up with something like this:
It should be clear that you can obtain all the tutorials by finding all the <a> tags within the <li> tags, which are under <ul> tags, which are in the <div> tag with “class=’content-wrapper’.” An XPath expression that obtains these is:
//div[@class='content-wrapper']/ul/li/a
Note: The double slash (//) at the front means “search anywhere in the document for the following tag.” This stops you having to go right from the top of the tree down through html, then body, etc.
Note: The double slash (//) at the front means “search anywhere in the document for the following tag.” This stops you having to go right from the top of the tree down through html, then body, etc.
Having located all of the <a> tags, you will then be interested in the “href” attributes of the <a> tags, and also the text contents within.
Open MasterViewController.m and add the following imports at the top, since you will need to use these classes later on:
#import "TFHpple.h"
#import "Tutorial.h"
#import "Contributor.h"
Next, add the following method above initWithNibName:bundle:, which will load the list of tutorials from raywenderlich.com:
-(void)loadTutorials {
// 1
NSURL *tutorialsUrl = [NSURL URLWithString:@"http://www.raywenderlich.com/tutorials"];
NSData *tutorialsHtmlData = [NSData dataWithContentsOfURL:tutorialsUrl];
// 2
TFHpple *tutorialsParser = [TFHpple hppleWithHTMLData:tutorialsHtmlData];
// 3
NSString *tutorialsXpathQueryString = @"//div[@class='content-wrapper']/ul/li/a";
NSArray *tutorialsNodes = [tutorialsParser searchWithXPathQuery:tutorialsXpathQueryString];
// 4
NSMutableArray *newTutorials = [[NSMutableArray alloc] initWithCapacity:0];
for (TFHppleElement *element in tutorialsNodes) {
// 5
Tutorial *tutorial = [[Tutorial alloc] init];
[newTutorials addObject:tutorial];
// 6
tutorial.title = [[element firstChild] content];
// 7
tutorial.url = [element objectForKey:@"href"];
}
// 8
_objects = newTutorials;
[self.tableView reloadData];
}
This might look scary, but let’s break it down and see what’s going on:
If you wanted, you could create an NSString from this using NSString’s alloc/initWithData:usingEncoding: to see the data. It would be the same as if you were to “view source” in your browser.
Note:
dataWithContentsOfURL: will block until the data has been returned. This means that the UI will become unresponsive until the data is fetched from the server. A better approach is to use NSURLConnection to asynchronously grab the data, but that’s beyond the scope of this tutorial.
- First you need to download the web page, so you create an NSURL with the appropriate URL string. Then you create an NSData object with the contents of that URL. This means “tutorialsHtmlData” will contain the entire HTML document in raw data form.
If you wanted, you could create an NSString from this using NSString’s alloc/initWithData:usingEncoding: to see the data. It would be the same as if you were to “view source” in your browser.
Note:
dataWithContentsOfURL: will block until the data has been returned. This means that the UI will become unresponsive until the data is fetched from the server. A better approach is to use NSURLConnection to asynchronously grab the data, but that’s beyond the scope of this tutorial. - Next you create a TFHpple parser with the data that you downloaded.
- Then you set up the appropriate XPath query and ask the parser to search using the query. This will return an array of nodes (in hpple land, these are TFHppleElement objects).
- Then you create an array to hold your new tutorial objects and loop over the obtained nodes.
- Inside the loop, you first create a new Tutorial object and add it to the array.
- Then you get the tutorial’s title from the node’s first child’s contents. If you look back at the tree, you should be able to see that this is the case.
- Then you get the tutorial’s URL from the “href” attribute of the node. It’s an <a> tag, so it gives you the linking URL. In our case, this is the tutorial’s URL.
- Finally you set _objects on the view controller to the new tutorials array you created, and ask the table view to reload its data.
Note:
dataWithContentsOfURL: will block until the data has been returned. This means that the UI will become unresponsive until the data is fetched from the server. A better approach is to use NSURLConnection to asynchronously grab the data, but that’s beyond the scope of this tutorial.
Before you build and run, do some spring cleaning on this class to remove some of the default behavior of the template project. Remove the insertNewObject: method, and change viewDidLoad to look like:
-(void)viewDidLoad {
[super viewDidLoad];
[self loadTutorials];
}
Make the tableView:cellForRowAtIndexPath: look like this:
-(UITableViewCell *)tableView:(UITableView *)tableView cellForRowAtIndexPath:(NSIndexPath *)indexPath {
static NSString *CellIdentifier = @"Cell";
UITableViewCell *cell = [tableView dequeueReusableCellWithIdentifier:CellIdentifier];
if (cell == nil) {
cell = [[UITableViewCell alloc] initWithStyle:UITableViewCellStyleSubtitle reuseIdentifier:CellIdentifier];
cell.accessoryType = UITableViewCellAccessoryDisclosureIndicator;
}
Tutorial *thisTutorial = [_objects objectAtIndex:indexPath.row];
cell.textLabel.text = thisTutorial.title;
cell.detailTextLabel.text = thisTutorial.url;
return cell;
}
Here we simply set the main label and the detail label to the tutorial title and URL.
Build and run, and you should be greeted with a list of tutorials!