Hey there you are. I'm finally back here after over two months without posting anything. Was I being lazy? No, definitely not. It would be the busiest summer you've ever heard. I carried out several projects, some still in progress and the others finished in July and August. Start from today on, I'm gonna share you guys what I did this summer, in five or six posts I suppose.

In this post, I'll show you how I crawled over 600k lines of prices from all aspects of a normal Chinese people's daily life. This is part of an Economic research led by Zhiwei Zhang, Chief Economist & Head of Equity Strategy, who gave me his generous guidance throughout the whole project. Due to the NDA I'm not allowed to put the data or results here, but thankfully I own the code and the code solely is fine.

In this very first post of a series, I'll introduce the structure of a basic crawler script, after which I'll try to collect housing prices from soufang.com, the largest first- and second-hand house market online in China.

A crawler consists of two parts: a URL (Uniform Resouce Locator) getter and an information getter. A URL, colloquially a.k.a. a web address, is the first thing we want when we want to collect some data from a certain site. It can be easy, especially when we merely change pages like below

in which I input the URL of the current page and get the next from the get_urls function. This is all because changing pages is often mechanical and we can directly see the pattern from the URL. However, sometimes a URL getter can be a bit complicated, e.g. when we want to get the URL to all sub-pages of different districts in Shanghai. They are not printed on the pront page of soufang.com, nor should we even consider this possibility: there are 18 districts in Shanghai, but 34 provinces in China, and currently 195 countries in the world. No one has the time to write down the URLs to all of them.

So first, we turn to the page we need:

The page looks like this:

Now, open the web inspector.

In the Element label, you can see that the color of elements in the page changes accordingly to where your mouse points in the inspector. Below is how I found the URL to Pudong (浦东) district.

It is an a element with an href property and thus the full URL to the sub-page reads http://esf.sh.fang.com/house-a025. Similarly we can get all the URLs one by one. But again, we want to be lazy and thus we look at the father node: an a element with a class element named qxName. All we need to do is to collect all href values in the son nodes of this one with the qxName class. Second check confirms that qxName is unique for classes.

I don't want to emphasize too much on technical details, for which part I strongly suggest you to resort to the documents of Beautiful Soup and Requests. Anyhow, we've arrived at the actual URL getter function I used.

Similarly we write a URL getter specific to different CBDs in a district.

Now the first part is done. What we do is to collect information we need. This time it's the average prices and the numbers of deals in the past month. On the page it looks like below:

while in the inspector, it's included in a p element with a class property of setNum. So the rest is easy:

Is that all? I'm afraid not. You may have noticed that I used a function named randHeader but never defined it. This is actually a function that generates random headers when you make a web request. We need random headers (and random proxies, which will be discussed in future posts) to disguise our requests as a real web surfer, say, using either a Macintosh or Windows in Chinese. The function reads as below:

Once again, I would strongly suggest you to resort to the documents of different packages rather than blogs like mine for technical issues. Docs are for the how-to-do question and blogs are for the why-to-do ones. All I want to share here is why I wrote my crawler in this way and hopefully after the tutorial my readers can write their own simple crawler as fast as I can.

Thanks for the reading, and below is the whole script.