Introduction To URL Rewriting

Many Web companies spend hours and hours agonizing over the best domain names for their clients. They try to find a domain name that is relevant and appropriate, sounds professional yet is distinctive, is easy to spell and remember and read over the phone, looks good on business cards and is available as a dot-com.

Or else they spend thousands of dollars to purchase the one they really want, which just happened to be registered by a forward-thinking and hard-to-find squatter in 1998.

They go through all that trouble with the domain name but neglect the rest of the URL, the element after the domain name. It, too, should be relevant, appropriate, professional, memorable, easy to spell and readable. And for the same reasons: to attract customers and improve in search ranking.

Fortunately, there is a technique called URL rewriting that can turn unsightly URLs into nice ones — with a lot less agony and expense than picking a good domain name. It enables you to fill out your URLs with friendly, readable keywords without affecting the underlying structure of your pages.

This article covers the following:

  1. What is URL rewriting?
  2. How can URL rewriting help your search rankings?
  3. Examples of URL rewriting, including regular expressions, flags and conditionals;
  4. URL rewriting in the wild, such as on Wikipedia, WordPress and shopping websites;
  5. Creating friendly URLs;
  6. Changing pages names and URLs;
  7. Checklist and troubleshooting.

What Is URL Rewriting?

If you were writing a letter to your bank, you would probably open your word processor and create a file named something like lettertobank.doc. The file might sit in your Documents directory, with a full path like C:\Windows\users\julie\Documents\lettertobank.doc. One file path = one document.

Similarly, if you were creating a banking website, you might create a page named page1.html, upload it, and then point your browser to http://www.mybanksite.com/page1.html. One URL = one resource. In this case, the resource is a physical Web page, but it could be a page or product drawn from a CMS.

URL rewriting changes all that. It allows you to completely separate the URL from the resource. With URL rewriting, you could have http://www.mybanksite.com/aboutus.html taking the user to …/page1.html or to …/about-us/ or to …/about-this-website-and-me/ or to …/youll-never-find-out-about-me-hahaha-Xy2834/. Or to all of these. It’s a bit like shortcuts or symbolic links on your hard drive. One URL = one way to find a resource.

With URL rewriting, the URL and the resource that it leads to can be completely independent of each other. In practice, they’re usually not wholly independent: the URL usually contains some code or number or name that enables the CMS to look up the resource. But in theory, this is what URL rewriting provides: a complete separation.

How Does URL Rewriting Help?

Can you guess what this Web page sells?

http://www.diy.com/diy/jsp/bq/nav.jsp?action=detail&fh_secondid=11577676

B&Q went to all the trouble and expense of acquiring diy.com and implementing a stock controlled e-commerce website, but left its URLs indecipherable. If you guessed “brown guttering,” you might want to considering playing the lottery.

Even when you search directly for this “miniflow gutter brown” on Google UK, B&Q’s page comes up only seventh in the organic search results, below much smaller companies, such as a building supplier with a single outlet in Stirlingshire. B&Q has 300+ branches and so is probably much bigger in budget, size and exposure, so why is it not doing as well for this search term? Perhaps because the other search results have URLs like http://www.prof…co.uk/products/brown-miniflo-gutter-148/; that is, the URL itself contains the words in the search term.

screenshot

Almost all of these results on Google have the search term in their URLs (highlighted in green). The one at the bottom does not.

Looking at the URL from B&Q, you would (probably correctly) assume that a file named nav.jsp within the directory /diy/jsp/bq/ is used to display products when given their ID number, 11577676 in this case. That is the resource intimately tied to this URL.

So, how would B&Q go about turning this into something more recognizable, like http://www.diy.com/products/miniflow-gutter-brown/11577676, without restructuring its whole website? The answer is URL rewriting.

Another way to look at URL rewriting is like a thin layer that sits on top of a website, translating human- and search-engine-friendly URLs into actual URLs. Doing it is easy because it requires hardly any changes to the website’s underlying structure — no moving files around or renaming things.

URL rewriting basically tells the Web server that
/products/miniflow-gutter-brown/11577676 should show the Web page at: /diy/jsp/bq/nav.jsp?action=detail&fh_secondid=11577676,
without the customer or search engine knowing about it.

Many factors (or “signals”), of course, determine the search ranking for a particular term, over 200 of them according to Google. But friendly and readable URLs are consistently ranked as one of the most important of those factors. They also help humans to quickly figure out what a page is about.

The next section describes how this is done.

How To Rewrite URLs

Whether you can implement URL rewriting on a website depends on the Web server. Apache usually comes with the URL rewriting module, mod_rewrite, already installed. The set-up is very common and is the basis for all of the examples in this article. ISAPI Rewrite is a similar module for Windows IIS but requires payment (about $100 US) and installation.

The Simplest Case

The simplest case of URL rewriting is to rename a single static Web page, and this is far easier than the B&Q example above. To use Apache’s URL rewriting function, you will need to create or edit the .htaccess file in your website’s document root (or, less commonly, in a subdirectory).

For instance, if you have a Web page about horses named Xu8JuefAtua.htm, you could add these lines to .htaccess:

1 RewriteEngine On
2 RewriteRule   horses.htm   Xu8JuefAtua.htm

Now, if you visit http://www.mywebsite.com/horses.htm, you’ll actually be shown the Web page Xu8JuefAtua.htm. Furthermore, your browser will remain at horses.htm, so visitors and search engines will never know that you originally gave the page such a cryptic name.

Introducing Regular Expressions

In URL rewriting, you need only match the path of the URL, not including the domain name or the first slash. The rule above essentially tells Apache that if the path contains horses.htm, then show the Web page Xu8JuefAtua.htm. This is slightly problematic, because you could also visit http://www.mywebsite.com/reallyfasthorses.html, and it would still work. So, what we really need is this:

1 RewriteEngine On
2 RewriteRule   ^horses.htm$   Xu8JuefAtua.htm

The ^horses.htm$ is not just a search string, but a regular expression, in which special characters — such as ^ . + * ? ^ ( ) [ ] { } and $ — have extra significance. The ^ matches the beginning of the URL’s path, and the $ matches the end. This says that the path must begin and end with horses.htm. So, only horses.htm will work, and not reallyfasthorses.htm or horses.html. This is important for search engines like Google, which can penalize what it views as duplicate content — identical pages that can be reached via multiple URLs.

Without File Endings

You can make this even better by ditching the file ending altogether, so that you can visit either http://www.mywebsite.com/horses or http://www.mywebsite.com/horses/:

1 RewriteEngine On
2 RewriteRule   ^horses/?$   Xu8JuefAtua.html  [NC]

The ? indicates that the preceding character is optional. So, in this case, the URL would work with or without the slash at the end. These would not be considered duplicate URLs by a search engine, but would help prevent confusion if people (or link checkers) accidentally added a slash. The stuff in brackets at the end of the rule gives Apache some further pointers. [NC] is a flag that means that the rule is case insensitive, so http://www.mywebsite.com/HoRsEs would also work.

Wikipedia Example

We can now look at a real-world example. Wikipedia appears to use URL rewriting, passing the title of the page to a PHP file. For instance…

http://en.wikipedia.org/wiki/Barack_obama

 

… is rewritten to:

http://en.wikipedia.org/w/index.php?title=Barack_obama

This could well be implemented with an .htaccess file, like so:

1 RewriteEngine On
2 #Look for the word "wiki" followed by a slash, and then the article title
3 RewriteRule   ^wiki/(.+)$   w/index.php?title=$1   [L]

The previous rule had /?, which meant zero or one slashes. If it had said /+, it would have meant one or more slashes, so even http://www.mywebsite.com/horses//// would have worked. In this rule, the dot (.) matches any character, so .+ matches one or more of any character — that is, essentially anything. And the parentheses — ( ) — ask Apache to remember what the .+ is. The rule above, then, tells Apache to look for wiki/ followed by one or more of any character and to remember what it is. This is remembered and then rewritten as $1. So, when the rewriting is finished, wiki/Barack_obama becomes w/index.php?title=Barack_obama

Thus, the page w/index.php is called, passing Barack_obama as a parameter. The w/index.php is probably a PHP page that runs a database lookup — like SELECT * FROM articles WHERE title='Barack obama' — and then outputs the HTML.

screenshot

You can also view Wikipedia entries directly, without the URL rewriting.

Comments and Flags

The example above also introduced comments. Anything after a # is ignored by Apache, so it’s a good idea to explain your rewriting rules so that future generations can understand them. The [L] flag means that if this rule matches, Apache can stop now. Otherwise, Apache would continue applying subsequent rules, which is a powerful feature but unnecessary for all but the most complex rule sets.

Implementing the B&Q Example

The recommendation for B&Q above could be implemented with an .htaccess file, like so:

1 RewriteEngine On
2 #Look for the word "products" followed by slash, product title, slash, id number
3 RewriteRule  ^products/.*/([0-9]+)$   diy/jsp/bq/nav.jsp?action=detail&fh_secondid=$1 [NC,L]

Here, the .* matches zero or more of any character, so nothing or anything. And the [0-9] matches a single numerical digit, so [0-9]+ matches one or more numbers.

The next section covers a couple of more complex conditional examples. You can also read the Apache rewriting guide for much more information on all that URL rewriting has to offer.

Conditional Rewriting

URL rewriting can also include conditions and make use of environment variables. These two features make for an easy way to redirect requests from one domain alias to another. This is especially useful if a website changes its domain, from mywebsite.co.uk to mywebsite.com for example.

Domain Forwarding

Most domain registrars allow for domain forwarding, which redirects all requests from one domain to another domain, but which might send requests for www.mywebsite.co.uk/horses to the home page at www.mywebsite.com and not to www.mywebsite.com/horses. You can achieve this with URL rewriting instead:

1 RewriteEngine On
2 RewriteCond   %{HTTP_HOST}   !^www.mywebsite.com$         [NC]
3 RewriteRule   (.*)           http://www.mywebsite.com/$1  [L,R=301]

The second line in this example is a RewriteCond, rather than a RewriteRule. It is used to compare an Apache environment variable on the left (such as the host name in this case) with a regular expression on the right. Only if this condition is true will the rule on the next line be considered.

In this case, %{HTTP_HOST} represents www.mywebsite.co.uk, the host (i.e. domain) that the browser is trying to visit. The ! means “not.” This tells Apache, if the host does not begin and end with www.mywebsite.com, then remember and rewrite zero or more of any character to www.mywebsite.com/$1. This converts www.mywebsite.co.uk/anything-at-all to www.mywebsite.com/anything-at-all. And it will work for all other aliases as well, like www.mywebsite.biz/anything-at-all and mywebsite.com/anything-at-all.

The flag [R=301] is very important. It tells Apache to do a 301 (i.e. permanent) redirect. Apache will send the new URL back to the browser or search engine, and the browser or search engine will have to request it again. Unlike all of the examples above, the new URL will now appear in the browser’s location bar. And search engines will take note of the new URL and update their databases. [R] by itself is the same as [R=302] and signifies a temporary redirect.

File Existence and WordPress

Smashing Magazine runs on the popular blogging software WordPress. WordPress enables the author to choose their own URL, called a “slug.” Then, it automatically prepends the date, such as http://coding.smashingmagazine.com/2011/09/05/getting-started-with-the-paypal-api/. In your pre-URL rewriting days, you might have assumed that Smashing Magazine’s Web server was actually serving up a file located at …/2011/09/05/getting-started-with-the-paypal-api/index.html. In fact, WordPress uses URL rewriting extensively.

screenshot

WordPress enables the author to choose their own URL for an article.

WordPress’ .htaccess file looks like this:

1 RewriteEngine On
2 RewriteBase /  
3 RewriteCond %{REQUEST_FILENAME} !-f
4 RewriteCond %{REQUEST_FILENAME} !-d
5 RewriteRule . /index.php [L]

The -f means “this is a file” and -d means “this is a directory.” This tells Apache, if the requested file name is not a file, and the requested file name is not a directory, then rewrite everything (i.e. any path containing any character) to the page index.php. If you are requesting an existing image or the log-in page wp-login.php, then the rule is not triggered. But if you request anything else, like /2011/09/05/getting-started-with-the-paypal-api/, then the file index.php jumps into action.

Internally, index.php (probably) looks at the environment variable $_SERVER['REQUEST_URI'] and extracts the information that it needs to find out what it is looking for. This gives it even more flexibility than Apache’s rewrite rules and enables WordPress to mimic some very sophisticated URL rewriting rules. In fact, when administering a WordPress blog, you can go to Settings → Permalink on the left side, and choose the type of URL rewriting that you would like to mimic.

screenshot

WordPress’ permalink settings, letting you choose the type of URL rewriting that you would like to mimic.

Rewriting Query Strings

If you are hired to recreate an existing website from scratch, you might use URL rewriting to redirect the 20 most popular URLs on the old website to the locations on the new website. This could involve redirecting things like prod.php?id=20 to products/great-product/2342, which itself gets redirected to the actual product page.

Apache’s RewriteRule applies only to the path in the URL, not to parameters like id=20. To do this type of rewriting, you will need to refer to the Apache environment variable %{QUERY_STRING}. This can be accomplished like so:

1 RewriteEngine On
2 RewriteCond   %{QUERY_STRING}           ^id=20$                   
3 RewriteRule   ^prod.php$             ^products/great-product/2342$      [L,R=301]
4 RewriteRule   ^products/(.*)/([0-9]+)$  ^productview.php?id=$1             [L]

In this example, the first RewriteRule triggers a permanent redirect from the old website’s URL to the new website’s URL. The second rule rewrites the new URL to the actual PHP page that displays the product.

Examples Of URL Rewriting On Shopping Websites

For complex content-managed websites, there is still the issue of how to map friendly URLs to underlying resources. The simple examples above did that mapping by hand, manually associating a URL like horses.htm with the file or resource Xu8JuefAtua.htm. Wikipedia looks up the resource based on the title, and WordPress applies some complex internal rule sets. But what if your data is more complex, with thousands of products in hundreds of categories? This section shows the approach that Amazon and many other shopping websites take.

If you’ve ever come across a URL like this on Amazon, http://www.amazon.co.uk/High-Voltage-AC-DC/dp/B00008AJL3, you might have assumed that Amazon’s website has a subdirectory named /High-Voltage-AC-DC/dp/ that contains a file named B00008AJL3.

This is very unlikely. You could try changing the name of the top-level “directory” and you would still arrive on the same page, http://www.amazon.co.uk/Test-Voltage-AC-DC/dp/B00008AJL3.

The bit at the end is what really matters. Looking down the page, you’ll see that B00008AJL3 is this AC/DC album’s ASIN (Amazon Standard Identification Number). If you change that, you’ll get a “Page not found” or an entirely different product: http://www.amazon.co.uk/High-Voltage-AC-DC/dp/B003BEZ7HI.

The /dp/ also matters. Changing this leads to a “Page not found.” So, the B00008AJL3 probably tells Amazon what to display, and the dp tells the website how to display it. This is URL rewriting in action, with the original URL possibly ending up getting rewritten to something like:
http://www.amazon.co.uk/displayproduct.php?asin=B00008AJL3.

Features of an Amazon URL

This introduces some important features of Amazon’s URLs that can be applied to any website with a complex set of resources. It shows that the URL can be automatically generated and can include up to three parts:

  1. The wordsIn this case, the words are based on the album and artist, and all non-alphanumeric characters are replaced. So, the slash in AC/DC becomes a hyphen. This is the bit that helps humans and search engines.
  2. An ID numberOr something that tells the website what to look up, such as B00008AJL3.
  3. An identifierOr something that tells the website where to look for it and how to display it. If dp tells Amazon to look for a product, then somewhere along the line, it probably triggers a database statement such as SELECT * FROM products WHERE id='B00008AJL3'.

Other Shopping Examples

Many other shopping websites have URLs like this. In the list below, the ID number and (suspected) identifier are in bold:

  • http://www.ebay.co.uk/itm/Ian-Rankin-Set-Darkness-Rebus-Novel-/140604842997
  • http://www.kelkoo.com/c-138201-lighting/brand/caravan
  • http://www.ciao.co.uk/Fridge_Freezers_5266430_3
  • http://www.gumtree.com/p/for-sale/boys-bmx-bronx-blaze/97669042
  • http://www.comet.co.uk/c/Televisions/LCD-Plasma-LED-TVs/1844

A significant benefit of this type of URL is that the actual words can be changed, as shown below. As long as the ID number stays the same, the URL will still work. So products can be renamed without breaking old links. More sophisticated websites (like Ciao above) will redirect the changed URL back to the real one and thus avoid creating the appearance of duplicate content (see below for more on this topic).

screenshot

Websites that use URL rewriting are more flexible with their URLs — the words can change but the page will still be found.

Friendly URLs

Now you know how to map nice friendly URLs to their underlying Web pages, but how should you create those friendly URLs in the first place?

If we followed the current advice, we would separate words with hyphens rather than underscores and capitalize consistently. Lowercase might be preferable because most people search in lowercase. Punctuation such as dots and commas should also be turned into hyphens, otherwise they would get turned into things like %2C, which look ugly and might break the URL when copied and pasted. You might want to remove apostrophes and parentheses entirely for the same reason.

Whether to replace accented characters is debatable. URLs with accents (or any non-Roman characters) might look bad or break when rendered in a different character format. But replacing them with their non-accented equivalents might make the URLs harder for search engines to find (and even harder if replaced with hyphens). If your website is for a predominately French audience, then perhaps leave the French accents in. But substitute them if the French words are few and far between on a mainly English website.

This PHP function succinctly handles all of the above suggestions:

1 function GenerateUrl ($s) {
2   //Convert accented characters, and remove parentheses and apostrophes
3   $from = explode (',', "ç,æ,œ,á,é,í,ó,ú,à,è,ì,ò,ù,ä,ë,ï,ö,ü,ÿ,â,ê,î,ô,û,å,e,i,ø,u,(,),[,],'");
4   $to = explode (',', 'c,ae,oe,a,e,i,o,u,a,e,i,o,u,a,e,i,o,u,y,a,e,i,o,u,a,e,i,o,u,,,,,,');
5   //Do the replacements, and convert all other non-alphanumeric characters to spaces
6   $s = preg_replace ('~[^\w\d]+~', '-', str_replace ($from, $to, trim ($s)));
7   //Remove a - at the beginning or end and make lowercase
8   return strtolower (preg_replace ('/^-/', '', preg_replace ('/-$/', '', $s)));
9 }

This would generate URLs like this:

1 echo GenerateUrl ("Pâtisserie (Always FRESH!)"); //returns "patisserie-always-fresh"

Or, if you wanted a link to a $product variable to be pulled from a database:

1 $product = array ('title'=>'Great product', 'id'=>100);
2 echo '<a href="' . GenerateUrl ($product['title']) . '/' . $product['id'] . '">';
3 echo $product['title'] . '</a>';

Changing Page Names

Search engines generally ignore duplicate content (i.e. multiple pages with the same information). But if they think they are being manipulated, search engines will actively penalize the website, so avoid this where possible. Google recommends using 301 redirects to send users from old pages to new ones.

When a URL-rewritten page is renamed, the old URL and new URL should both still work. Furthermore, to avoid any risk of duplication, the old URL should automatically redirect to the new one, as WordPress does.

Doing this in PHP is relatively easy. The following function looks at the current URL, and if it’s not the same as the desired URL, it redirects the user:

1 function CheckUrl ($s) {
2   // Get the current URL without the query string, with the initial slash
3   $myurl = preg_replace ('/\?.*$/', '', $_SERVER['REQUEST_URI']);
4   //If it is not the same as the desired URL, then redirect
5   if ($myurl != "/$s") {Header ("Location: /$s", true, 301); exit;}
6 }

This would be used like so:

1 $producturl = GenerateUrl ($product['title']) . '/' . $product['id'];
2 CheckUrl ($producturl); //redirects the user if they are at the wrong place

If you would like to use this function, be sure to test it in your environment first and with your rewrite rules, to make sure that it does not cause any infinite redirects. This is what that would look like:

screenshot

This is what happens when Google Chrome visits a page that redirects to itself.

Checklist And Troubleshooting

Use the following checklist to implement URL rewriting.

1. Check That It’s Supported

Not all Web servers support URL rewriting. If you put up your .htaccess file on one that doesn’t, it will be ignored or will throw up a “500 Internal Server Error.”

2. Plan Your Approach

Figure out what will get mapped to what, and how the correct information will still get found. Perhaps you want to introduce new URLs, like my-great-product/p/123, to replace your current product URLs, like product.php?id=123, and to substitute new-category/c/12 for category.php?id=12.

3. Create Your Rewrite Rules

Create an .htaccess file for your new rules. You can initially do this in a /testing/ subdirectory and using the [R] flag, so that you can see where things go:

1 RewriteEngine On
2 RewriteRule   ^.+/p/([0-9]+)   product.php?id=$1    [NC,L,R]
3 RewriteRule   ^.+/c/([0-9]+)   category.php?id=$1    [NC,L,R]

Now, if you visit www.mywebsite.com/testing/my-great-product/p/123, you should be sent to www.mywebsite.com/testing/product.php?id=123. You’ll get a “Page not found” because product.php is not in your /testing/ subdirectory, but at least you’ll know that your rules work. Once you’re satisfied, move the .htaccess file to your document root and remove the [R] flag. Now www.mywebsite.com/my-great-product/p/123 should work.

4. Check Your Pages

Test that your new URLs bring in all the correct images, CSS and JavaScript files. For example, the Web browser now believes that your Web page is named 123 in a directory named my-great-product/p/. If the HTML refers to a file named images/logo.jpg, then the Web browser would request the image from www.mywebsite.com/my-great-product/p/images/logo.jpg and would come up with a “File not found.”

You would need to also rewrite the image locations or make the references absolute (like <img src="/images/logo.jpg"/>) or put a base href at the top of the <head> of the page (<base href="/product.php"/>). But if you do that, you would need to fully specify any internal links that begin with # or ? because they would now go to something like product.php#details.

5. Change Your URLs

Now find all references to your old URLs, and replace them with your new URLs, using a function such as GenerateUrl to consistently create the new URLs. This is the only step that might require looking deep into the underlying code of your website.

6. Automatically Redirect Your Old URLs

Now that the URL rewriting is in place, you probably want Google to forget about your old URLs and start using the new ones. That is, when a search result brings up product.php?id=20, you’d want the user to be visibly redirected to my-great-product/p/123, which would then be internally redirected back to product.php?id=20.

This is the reverse of what your URL rewriting already does. In fact, you could add another rule to .htaccess to achieve this, but if you get the rules in the wrong order, then the browser would go into a redirect loop.

Another approach is to do the first redirect in PHP, using something like the CheckUrl function above. This has the added advantage that if you rename the product, the old URL will immediately become invalid and redirect to the newest one.

7. Update and Resubmit Your Site Map

Make sure to carry through your new URLs to your site map, your product feeds and everywhere else they appear.

Conclusion

URL rewriting is a relatively quick and easy way to improve your website’s appeal to customers and search engines. We’ve tried to explain some real examples of URL rewriting and to provide the technical details for implementing it on your own website. Please leave any comments or suggestions below.

(al)

in http://coding.smashingmagazine.com/2011/11/02/introduction-to-url-rewriting/

Strace – The Sysadmin Microscope

Sometimes as a sysadmin the logfiles just don’t cut it, and to solve a problem you need to know what’s really going on. That’s when I turn to strace — the system-call tracer.

A system call, or syscall, is where a program crosses the boundary between user code and the kernel. Fortunately for us using strace, that boundary is where almost everything interesting happens in a typical program.

The two basic jobs of a modern operating system are abstraction and multiplexing. Abstraction means, for example, that when your program wants to read and write to disk it doesn’t need to speak the SATA protocol, or SCSI, or IDE, or USB Mass Storage, or NFS. It speaks in a single, common vocabulary of directories and files, and the operating system translates that abstract vocabulary into whatever has to be done with the actual underlying hardware you have. Multiplexing means that your programs and mine each get fair access to the hardware, and don’t have the ability to step on each other — which means your program can’t be permitted to skip the kernel, and speak raw SATA or SCSI to the actual hardware, even if it wanted to.

So for almost everything a program wants to do, it needs to talk to the kernel. Want to read or write a file? Make the open() syscall, and then the syscalls read() or write(). Talk on the network? You need the syscalls socket(), connect(), and again read() and write(). Make more processes? First clone() (inside the standard C library function fork()), then you probably want execve() so the new process runs its own program, and you probably want to interact with that process somehow, with one of wait4(), kill(), pipe(), and a host of others. Even looking at the clock requires a system call, clock_gettime(). Every one of those system calls will show up when we apply strace to the program.

In fact, just about the only thing a process can do without making a telltale system call is pure computation — using the CPU and RAM and nothing else. As a former algorithms person, that’s what I used to think was the fun part. Fortunately for us as sysadmins, very few real-life programs spend very long in that pure realm between having to deal with a file or the network or some other part of the system, and then strace picks them up again.

Let’s look at a quick example of how strace solves problems.

Use #1: Understand A Complex Program’s Actual Behavior

One day, I wanted to know which Git commands take out a certain lock — I had a script running a series of different Git commands, and it was failing sometimes when run concurrently because two commands tried to hold the lock at the same time.

Now, I love sourcediving, and I’ve done some Git hacking, so I spent some time with the source tree investigating this question. But this code is complex enough that I was still left with some uncertainty. So I decided to get a plain, ground-truth answer to the question: if I run “git diff“, will it grab this lock?

Strace to the rescue. The lock is on a file called index.lock. Anything trying to touch the file will show up in strace. So we can just trace a command the whole way through and use grep to see if index.lock is mentioned:

$ strace git status 2>&1 >/dev/null | grep index.lock
open(".git/index.lock", O_RDWR|O_CREAT|O_EXCL, 0666) = 3
rename(".git/index.lock", ".git/index") = 0

$ strace git diff 2>&1 >/dev/null | grep index.lock
$

So git status takes the lock, and git diff doesn’t.

Interlude: The Toolbox

To help make it useful for so many purposes, strace takes a variety of options to add or cut out different kinds of detail and help you see exactly what’s going on.

In Medias Res, If You Want

Sometimes we don’t have the luxury of starting a program over to run it under strace — it’s running, it’s misbehaving, and we need to find out what’s going on. Fortunately strace handles this case with ease. Instead of specifying a command line for strace to execute and trace, just pass -p PID where PID is the process ID of the process in question — I find pstree -p invaluable for identifying this — and strace will attach to that program, while it’s running, and start telling you all about it.

Times

When I use strace, I almost always pass the -tt option. This tells me when each syscall happened — -t prints it to the second, -tt to the microsecond. For system administration problems, this often helps a lot in correlating the trace with other logs, or in seeing where a program is spending too much time.

For performance issues, the -T option comes in handy too — it tells me how long each individual syscall took from start to finish.

Data

By default strace already prints the strings that the program passes to and from the system — filenames, data read and written, and so on. To keep the output readable, it cuts off the strings at 32 characters. You can see more with the -s option — -s 1024 makes strace print up to 1024 characters for each string — or cut out the strings entirely with -s 0.

Sometimes you want to see the full data flowing in just a few directions, without cluttering your trace with other flows of data. Here the options -e read= and -e write= come in handy.

For example, say you have a program talking to a database server, and you want to see the SQL queries, but not the voluminous data that comes back. The queries and responses go via write() and read() syscalls on a network socket to the database. First, take a preliminary look at the trace to see those syscalls in action:

$ strace -p 9026
Process 9026 attached - interrupt to quit
read(3, "\1\0\0\1\1A\0\0\2\3def\7youtomb\tartifacts\ta"..., 16384) = 116
poll([{fd=3, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
write(3, "0\0\0\0\3SELECT timestamp FROM artifa"..., 52) = 52
read(3, "\1\0\0\1\1A\0\0\2\3def\7youtomb\tartifacts\ta"..., 16384) = 116
poll([{fd=3, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
write(3, "0\0\0\0\3SELECT timestamp FROM artifa"..., 52) = 52
[...]

Those write() syscalls are the SQL queries — we can make out the SELECT foo FROM bar, and then it trails off. To see the rest, note the file descriptor the syscalls are happening on — the first argument of read() or write(), which is 3 here. Pass that file descriptor to -e write=:

$ strace -p 9026 -e write=3
Process 9026 attached - interrupt to quit
read(3, "\1\0\0\1\1A\0\0\2\3def\7youtomb\tartifacts\ta"..., 16384) = 116
poll([{fd=3, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
write(3, "0\0\0\0\3SELECT timestamp FROM artifa"..., 52) = 52
 | 00000  30 00 00 00 03 53 45 4c  45 43 54 20 74 69 6d 65  0....SEL ECT time |
 | 00010  73 74 61 6d 70 20 46 52  4f 4d 20 61 72 74 69 66  stamp FR OM artif |
 | 00020  61 63 74 73 20 57 48 45  52 45 20 69 64 20 3d 20  acts WHE RE id =  |
 | 00030  31 34 35 34                                       1454              |

and we see the whole query. It’s both printed and in hex, in case it’s binary. We could also get the whole thing with an option like -s 1024, but then we’d see all the data coming back via read() — the use of -e write= lets us pick and choose.

Filtering the Output

Sometimes the full syscall trace is too much — you just want to see what files the program touches, or when it reads and writes data, or some other subset. For this the -e trace= option was made. You can select a named suite of system calls like -e trace=file (for syscalls that mention filenames) or -e trace=desc (for read() and write() and friends, which mention file descriptors), or name individual system calls by hand. We’ll use this option in the next example.

Child Processes

Sometimes the process you trace doesn’t do the real work itself, but delegates it to child processes that it creates. Shell scripts and Make runs are notorious for taking this behavior to the extreme. If that’s the case, you may want to pass -f to make strace “follow forks” and trace child processes, too, as soon as they’re made.

For example, here’s a trace of a simple shell script, without -f:

$ strace -e trace=process,file,desc sh -c \
   'for d in .git/objects/*; do ls $d &gt;/dev/null; done'                                                                                       
[...]
stat("/bin/ls", {st_mode=S_IFREG|0755, st_size=101992, ...}) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f4b68af5770) = 11948
wait4(-1, [{WIFEXITED(s) &amp;&amp; WEXITSTATUS(s) == 0}], 0, NULL) = 11948                                                                      
--- SIGCHLD (Child exited) @ 0 (0) --
wait4(-1, 0x7fffc3473604, WNOHANG, NULL) = -1 ECHILD (No child processes)

Not much to see here — all the real work was done inside process 11948, the one created by that clone() syscall.

Here’s the same script traced with -f (and the trace edited for brevity):

$ strace -f -e trace=process,file,desc sh -c \
   'for d in .git/objects/*; do ls $d >/dev/null; done'                                                                                          
[...]
stat("/bin/ls", {st_mode=S_IFREG|0755, st_size=101992, ...}) = 0
clone(Process 10738 attached
child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f5a93f99770) = 10738
[pid 10682] wait4(-1, Process 10682 suspended

[pid 10738] open("/dev/null", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
[pid 10738] dup2(3, 1)                  = 1
[pid 10738] close(3)                    = 0
[pid 10738] execve("/bin/ls", ["ls", ".git/objects/28"], [/* 25 vars */]) = 0
[... setup of C standard library omitted ...]
[pid 10738] stat(".git/objects/28", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
[pid 10738] open(".git/objects/28", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
[pid 10738] getdents(3, /* 40 entries */, 4096) = 2480
[pid 10738] getdents(3, /* 0 entries */, 4096) = 0
[pid 10738] close(3)                    = 0
[pid 10738] write(1, "04102fadac20da3550d381f444ccb5676"..., 1482) = 1482
[pid 10738] close(1)                    = 0
[pid 10738] close(2)                    = 0
[pid 10738] exit_group(0)               = ?
Process 10682 resumed
Process 10738 detached
<... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 10738
--- SIGCHLD (Child exited) @ 0 (0) ---

Now this trace could be a miniature education in Unix in itself — future blog post? The key thing is that you can see ls do its work, with that open() call followed by getdents().

The output gets cluttered quickly when multiple processes are traced at once, so sometimes you want -ff, which makes strace write each process’s trace into a separate file.

Use #2: Why/Where Is A Program Stuck?

Sometimes a program doesn’t seem to be doing anything. Most often, that means it’s blocked in some system call. Strace to the rescue.

$ strace -p 22067
Process 22067 attached - interrupt to quit
flock(3, LOCK_EX

Here it’s blocked trying to take out a lock, an exclusive lock (LOCK_EX) on the file it’s opened as file descriptor 3. What file is that?

$ readlink /proc/22067/fd/3
/tmp/foobar.lock

Aha, it’s the file /tmp/foobar.lock. And what process is holding that lock?

 $ lsof | grep /tmp/foobar.lock
 command   21856       price    3uW     REG 253,88       0 34443743 /tmp/foobar.lock
 command   22067       price    3u      REG 253,88       0 34443743 /tmp/foobar.lock

Process 21856 is holding the lock. Now we can go figure out why 21856 has been holding the lock for so long, whether 21856 and 22067 really need to grab the same lock, etc.

Other common ways the program might be stuck, and how you can learn more after discovering them with strace:

  • Waiting on the network. Use lsof again to see the remote hostname and port.
  • Trying to read a directory. Don’t laugh — this can actually happen when you have a giant directory with many thousands of entries. And if the directory used to be giant and is now small again, on a traditional filesystem like ext3 it becomes a long list of “nothing to see here” entries, so a single syscall may spend minutes scanning the deleted entries before returning the list of survivors.
  • Not making syscalls at all. This means it’s doing some pure computation, perhaps a bunch of math. You’re outside of strace‘s domain; good luck.

Uses #3, #4, …

A post of this length can only scratch the surface of what strace can do in a sysadmin’s toolbox. Some of my other favorites include

  • As a progress bar. When a program’s in the middle of a long task and you want to estimate if it’ll be another three hours or three days, strace can tell you what it’s doing right now — and a little cleverness can often tell you how far that places it in the overall task.
  • Measuring latency. There’s no better way to tell how long your application takes to talk to that remote server than watching it actually read() from the server, with strace -T as your stopwatch.
  • Identifying hot spots. Profilers are great, but they don’t always reflect the structure of your program. And have you ever tried to profile a shell script? Sometimes the best data comes from sending a strace -tt run to a file, and picking through to see when each phase of your program started and finished.
  • As a teaching and learning tool. The user/kernel boundary is where almost everything interesting happens in your system. So if you want to know more about how your system really works — how about curling up with a set of man pages and some output from strace?

Original Article

Humble Frozen Synapse Bundle

Introducing the Humble Frozen Synapse Bundle!

Humble Bundle is back with another pay-what-you-want plus charity deal on sweet indie games — the Humble Frozen Synapse Bundle! Thanks to everyone’s past support, Humble Bundles have now raised over $2,000,000 for charity (the Electronic Frontier Foundation and Child’s Play charity).

This bundle features the exquisite, turn-based tactical strategy game Frozen Synapse, now available on Linux for the first time ever. We’re also launching with a bonus incentive: purchasers who beat the average price on the site will receive the Humble Frozenbyte Bundle, which includes Trine, Shadowgrounds, Shadowgrounds Survivor, the Jack Claw game prototype, and a preorder for Splot.

When you buy the bundle, you’ll not only get DRM-free copies of the games for Mac, Windows and Linux, but you’ll also get redemption keys for Steam and other platforms.

The promotion will only last 14 days though, so please tell your friends and check out the Humble Frozen Synapse Bundle!

Moving in

Hi!

I decided to buy a new domain, so here I am!
The purpose of this website will never be to give news to anyone, but simply to gather and centralize useful information I need on my daily job, personal projects or even hobbies, just like the wiki. Sometimes, some news can be inserted, but only for documentation purposes.

I hope you enjoy it.

Cheers,
Green Tuxer