Crawling the Web

Websites as a source of information

I tried extracting information from HTML pages by writing Python code. This is what I had to think through to process the data. TLDR; It worked mostly, but I still had to do some manual editing. I should have written more exploratory code to find out what I'm getting into first, before writing code to do the actual task. Next time.

Also, I'm using the Chrome browser, and if you're not, these instructions will vary.

An example

I decided to get CIA Factbook data from [CIA factbook]. Initially, I considered the option of using a web crawler to get the data. Here's some information on what a webcrawler should consider.

robots.txt

Downloading data from a website is not as easy as it looks. Each website has a robots.txt file that will show you which areas of the site your software can go to, and which areas it cannot. Here's an example of the file from Yahoo:

User-agent: *
Disallow: /p/
Disallow: /r/
Disallow: /bin/
Disallow: /includes/
Disallow: /blank.html
Disallow: /_td_api
Disallow: /_tdpp_api
Disallow: /_remote
Disallow: /_multiremote
Disallow: /_tdhl_api
Disallow: /digest
Disallow: /fpjs
Disallow: /myjs

As you can see, there are many parts of the Yahoo website that a robot may not go. Anything that is not disallowed, is considered open for web crawlers. If the website wanted to not allow any robots at all, it would put this in the robots.txt file: User-agent: * Disallow: /

User-agent

The user-agent data can be easily spoofed, and is not considered as important. To find your user-agent, look in your browser. For Chrome:

1. Open the "Developer Tools"
2. Select the "Network" tab
3. Load Google's home page. You will see a lot of scrolling in  the Network tab.
4. Select the topmost item in the list (this will be the GET request to google.com).
5. On the "Request Headers" - click "view source".
6. Check the "User-Agent" line that is added by your browser. In my case, the user-agent was:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36

Delay

Now that you can visit some of Yahoo's website, you must make sure that you don't hit it too often. It is better to put a delay between visits - probably something more than a minute This way, the website owners will not block you from visiting.

Other better options

If you can avoid scraping the website, that would be the option to take. For my case, the CIA Factbook was available for download [here.]

Another option is if you can get an API (ex. Twitter API to download tweets). Then the work you have to do is to understand the API and use it correctly. With sites like Twitter, there are many people who have already successfully used their API and got it to work Search in your favorite browser for anything you don't understand.

Understanding HTML

Looking at the HTML code

Now that you have the HTML, you have to extract information from it. One way to see the HTML side-by-side with the web page is:

1. Open the downloaded file in your browser.  You should give the full pathname to your file (ex. file:///Users/YourName/data/file1.html)
2. Open Developer Tools.  This will open a window on the right side of the page.  Make sure the Elements tab is selected on that page.
3. If you're curious about how a particular value on the left side (on the actual page) looks like in HTML, highlight that value, then right-click and select 'Inspect'.  The right side will expand to the correct location.  This beats having to browse through the whole HTML and opening sections until you find the spot.
4. If you for some reason cannot select the item on the left hand side (the actual page), you can keep selecting the little arrows inside the right window to open some sections of the HTML.  As you move your mouse over the HTML, it will highlight the selection on the left hand side. This way, you can narrow down to the section you want to look at

Understanding basic HTML syntax

HTML tags are made of angle brackets. The name within those brackes is the name of the tag. Here we show two tags (the start and end tags of the element). The end tag's name is prepended with the '/'.

<body>
</body>

Some tags do not have an end tag. These tags end with a space followed by a '/' after their name. This is a line break tag and produces an extra vertical space:

 <br />

BeautifulSoup

Python's BeautifulSoup allows you to process HTML code. This is the official website, and the documentation is really good. I will only go through a minor part of what BeautifulSoup can do. You should really read the documentation to get the most out of it.

To process the HTML file, you read it in it's entirety as a string, and pass it to BeautifulSoup.

with open(filename, 'r') as f: 
    soup = BeautifulSoup(f.read(), 'html_parser')

Of course, you should put the above in a try/except block. It's a good idea when calling library functions, since you never know how they will change and raise exceptions as you update your libraries.

The soup object can be used to find tags with a specific name, or with a specific attribute, or both.

Finding data

This is what my HTML looked like:

    <td valign="top" width="200" style="border-right:2px solid white; padding-left:4px;" class="fl_region"><a href="../geos/af.html" style="font-weight: bold;" >Afghanistan</a>


  <td class="category_data"  valign="top" style="padding-left:5px;">
  2.629% (2009 est.) 
    </td>

Yes, I know it's grungy. Maybe this is the norm - most customers don't look at HTML code.

Notice the first tag has the 'class' attribute whose value is 'fl_region'. It's child is the string 'Afghanistan'. The second tag has the class 'category_data', and it's child is '2.629% (2009 est.)'. The second tag is the child of the first tag.

What's more, there is another hidden child, between the tag and the second tag, and that is the '\n' (carraige return) value.

This code gets you all the tags that match the class value, and the country name:

matching_tds = soup.find_all('td', class_ = 'fl_region')
    for child in matching_tds:
        if (child.name == 'a') and len(child.string > 0):
            country = child.string.strip()
                if ((child.name == 'a') and \
                    (len(child.string) > 0)):
                    country = child.string.strip()

Conclusion

Text can contain whatever the user wants to put in. Since HTML is text, it can be different from tag to tag. In the above example, if the tags were not exactly the same, your code would need to change. If the values were not exactly the same, your code will need to change further.

Having gone through this now, I would change my approach this way:

1. If possible, use an API to get the data.  This puts the onus on the API creator to have the data in a good format.  It puts the onus on you to learn the API.
2. Write code to find out more information about how similar your tags/values are.  Or maybe you read the HTML file to check if they're consistent.  For a small file, this is fine.  For a large file, or multiple files, you may have to write at least some code.
3. Put the output of your code into a dictionary.  The keys will be the country name (for example), and the value the tag with the data.  The idea is to put the more variable data as the key.  Putting it in a dictionary will give you unique keys, thus showing you what variability there is, and how to code for it.
4. Decide how to code for the variability, and go for it.