Wednesday, March 12, 2008

C#'s Query Syntax for LINQ

Take advantage of the new keywords associated with C# 3.0's query syntax. Learn how these keywords map to methods defined using the query operands, and how you can define your own custom implementation for the query keywords.


C# 3.0 includes many strong new features, but one of the most interesting is the inclusion of its new query keywords, that you use to create Query Expressions. Query Expressions are most commonly associated with Language Integrated Query (LINQ). They're the core syntax that you'll use when you create LINQ queries, whether you use LINQ to Objects, LINQ to SQL, LINQ to XML, or some other custom LINQ provider. I'll discuss the query keywords, how those keywords map to methods defined using the query operands, and how you can define your own custom implementation for the query keywords.

One of the design goals of the C# language (and most computer programming languages) is to keep the set of keywords small. Fewer keywords means there's more vocabulary available for your types, members, and other user-defined identifiers. It also simplifies the job of understanding the language. The C# team worked to minimize the necessary grammar for LINQ and the associated query syntax. They ended up adding eight contextual keywords to the C# language specifically for query: from, where, select, group, into, orderby, join, and let.

The C# team also saw type inference as a major goal of the new release. The C# language is still strongly typed: every identifier has a known type, at least to the compiler. However, a developer must declare the type explicitly only when the compiler cannot make its own determination of the variable's type. In many cases, the compiler determines the proper type for you.

Every query expression begins with the "from" keyword. From introduces a query expression by declaring the datasource and range variable. This snippet defines a simple query:

int[] someNumbers = { 2, 3, 7, 11, 13, 17, 19 };
var query = from number in someNumbers

someNumbers is the datasource, and number is the range variable. The datasource must implement IEnumerable, or IEnumerable. In this example, someNumbers implements IEnumerable. The compiler gives the range variable a type to match the type enumerated by the datasource. In the example above, number will be an integer. In general, the range variable is of type T, whenever the datasource implements IEnumerable. In the case where the range variable implements IEnumerable, the range variable is a System.Object. You never need to declare the type of the range variable; the compiler infers it.

Do More with "From"
The "from" clause can do much more than iterate a single sequence. Often, an outer sequence will contain an inner sequence that contains the data you're interested in. You can nest from clauses to create the sequence of elements from the inner sequences . Using multiple from clauses when the second datasource isn't a member of the first range variable will do a cross-join. This simple expression produces 25 pairs of values, ranging from {1,2} through {9,2} on to {1,10}, and, finally, through {9,10}:

int[] evenNumbers = { 2, 4, 6, 8, 10 };
int[] odddNumbers = { 1, 3, 5, 7, 9 };

var pairs = from even in evenNumbers
from odd in odddNumbers
select new { even, odd };

foreach (var pair in pairs)
Console.WriteLine(
" odd = {0}, even = {1}",
pair.odd, pair.even);

This from clause generates every permutation of a value from the first sequence value, and then pairs it with a value from the second sequence. Of course, you're not limited to two sources. You could do an N-way cross-join for any number N. Just remember that the size of the result set will grow geometrically as you add more source sequences.

The next new keyword is select. Select determines what type of object or objects are returned by your query. Select is powerful enough to create whatever types you want, including anonymous types, as the output of a query. Select projects the values you want into the output of a query. You've already seen two examples of the select statement. Listing 1 showed a simple select that returns the same type as the input query. The immediately preceding code snippet illustrating the cross-join shows an example of a select that transforms the input type into another type. In that case, it creates a new type containing a number from the first sequence and a number from the second sequence. The compiler does quite a bit of work for you in that example. The compiler infers the shape of the items in the output sequence, and it creates an anonymous type matching that description. The output sequence contains values of that anonymous type. It sounds more difficult than it really is. In that example, the compiler creates a type that contains two integers: odd and even. Those integers are accessed through public read-only properties. The compiler writes a type similar to the type you would have created by hand.

In general, you can use the select keyword to transform the values in the query into any type you decide you want for the output from the query. You'll see this feature at work in several of this article's examples. For instance, you'll see that you can compute values as part of the output, combine multiple inputs into a single output stream, generate new types, or almost anything else. Select is quite powerful because its argument is an arbitrary lambda expression that can return anything computed from the input sequence.

Filter the Input Sequence
Another keyword you'll use frequently is where. Where filters the input sequence and passes only those values that match the condition specified by the where predicate. You can place where clauses almost anywhere in your query expression. The only restrictions are that it must not be first, nor can it be last. For example, assume you want to filter the input sequence so it returns only those values greater than five:

var query = from number in someNumbers
where number > 5
select number;

Where clauses can also be more complicated. For example, you can combine multiple conditions into a single where clause:

var query = from number in someNumbers
where number > 5 && number * number

An alternative approach is to express the same construct with multiple where clauses:

var query = from number in someNumbers
where number > 5
where number * number

Where works the same way even when there are multiple sources, and you want to provide filters on those elements. All these query expressions produce the same result:

var pairs1 = from even in evenNumbers
where even > 4
from odd in odddNumbers
where odd < pairs2 =" from"> 4
where odd < pairs2 =" from"> 4 && odd

In general, you should consider placing where clauses as early as possible in your queries. Later clauses in your query operate on a smaller set of data.

Another new query keyword is orderby. Orderby creates an ordering of the output sequence. You can order the output elements on any expression that's part of the input sequence, even if that element isn't part of the output sequence, or by an element that's computed from part of the input sequence.

This query sorts the list of names by the length of the name and then by alphabetical order:

var folks = from person in myPeeps
orderby person.Last.Length + person.First.Length
descending,
person.Last,
person.First
select string.Format("{0}, {1}",
person.Last, person.First);

The first orderby arranges the sequence of elements from longest to shortest. In the case of ties, the last name will be used to order the people. If there are still any ties, the first name will be used. Note that I've also transformed the person object into a string in the select statement.

As I mentioned earlier, the select statement can create any output type you wish. In most cases, I like to put the orderby clauses as late in the query expression as possible. Orderby needs the full sequence to do its work, so it creates a natural bottleneck on the sequence processing.

Subdivide the Results
Occasionally, you'll use the group clause to subdivide the results of the query into logical groupings. You could modify the previous query by grouping the names by the length of the first name:

var folks = from person in myPeeps
group person by person.First.Length into names
orderby names.Key
select names;

There are a number of new ideas here, so let's look at them closely. This group expression creates subdivisions based on the length of the first name of each person. It's common to want to perform further processing on each group. Therefore, you'll often see an "into" following the group clause. The into keyword assigns a name to each grouping. In the above example, that grouping is called "names." Next, you see the familiar orderby clause to order the groups based on the key assigned to each group. The Key is the value of the "by" portion of the group clause. In this example, the Key is an integer, which stores the length of the first names in the group.

The result sequence of the query uses a different type when you add the group clause. The results for a group clause are stored in an IEnumerable> sequence. That's a sequence of groups. For each element, the Key element stores the key, and the IGrouping implements IEnumerable so that you can iterate the values in each group.

You use the join clause to perform equijoins. An equijoin creates a single sequence from two different input sequences based on an equality test of some property in both sequences. There are three different variations of join. Join is most easily illustrated in the context of LINQ to SQL, so these examples are taken from the Sample Queries sample application that ships with Visual Studio 2008. They perform joins on different tables in the Northwind database.

This query performs an inner join between the categories and products tables to produce a single sequence with the products and categories listed:

var q =
from c in db.Categories
join p in db.Products on c.ID equals
p.CategoryID
select new
{productName = p.ProductName,
categoryName = c.Name};

This code snippet creates a single sequence that contains the product name and category name. If a category doesn't have any matching products, that category doesn't appear in the output.

If you use the into keyword with your join clause, you create a group join. Similar to the groupby clause, a group join creates a list of lists for each group. In that same vein, you could create a list of all categories where each element contains the list of all products in that category:

var q =
from c in db.Categories
join p in db.Products on c.ID equals
p.CategoryID into productGroup
select new
{categoryName = c.Name,
Products =productGroup};

The only noticeable difference is the addition of the into keyword to create a set of groups. However, when you use the group join, any of the elements on the left side source won't appear in the output sequence unless there are matching elements in the right input sequence. In the preceding example, that means any categories with no products won't appear in the output sequence. To fix that, you need the left outer join. That requires using the DefaultIfEmpty method. This variation returns all categories, even those with no products:

var q =
from c in db.Categories
join p in db.Products on c.ID equals
p.CategoryID into productGroup
from aproduct in
productGroup.DefaultIfEmpty()
select new
{categoryName = c.Name,
Product =aproduct};

The left outer join produces a single, flat sequence, like the inner join. The element will contain the category and the product. In the approach used here, the product will be null if there are no products associated with a given category.

Only one keyword remains--let. Let creates a new range variable in the query expression. You can use this for a couple different purposes. First, if the new range variable is an enumerable type, it can be queried. Alternatively, it can store the result of some computation to avoid recomputing it.

Let's look at this keyword in action.

Suppose you want to modify the pair of numbers sample to keep only those pairs of numbers where the distance from the origin is less than 10. You could modify the query this way:

var pairs = from even in evenNumbers
from odd in odddNumbers
where Math.Sqrt(
even * even + odd * odd)

That's fine, but now assume you also want to include the distance from the origin in the output sequence. You could write it this way, but you'd be computing the distance twice for every value:

var pairs = from even in evenNumbers
from odd in odddNumbers
where Math.Sqrt(even * even + odd * odd) < distance =" Math.Sqrt(even">

This is a great place to use the let clause. You can rewrite the clause this way so that you have to compute the distance only once:

var pairs = from even in evenNumbers
from odd in odddNumbers
let distance = Math.Sqrt(even * even + odd * odd)
where distance

The second use of let is less common. If your interim calculation produces a sequence, you could use that range variable in a from clause and process each element in the sequence.

Map Keywords to Methods
That's all the new queries that support query expressions. Most of these new keywords compile to method calls. Where compiles to a call to the Where() method. Select compiles to a call to the Select() method. The first orderby compiles to a call to the OrderBy() method. Subsequent calls compile to ThenBy(). If the orderby clauses include the descending keyword, the methods called are OrderByDescending() and ThenByDescending(). Group clauses translate into the GroupBy() method calls. Join usually compiles to a call to Join(); if it's followed by "into," it compiles to a GroupJoin() method.

The .NET 3.5 base class library contains versions of each of these methods in two different classes: System.Linq.Queryable and System.Linq.Enumerable. The versions in Enumerable are defined so that the standard query operators can be used in any sequence. The Queryable version is used when the source supports IQueryable.

This might not sound so important, but there's a good reason for you to understand that these keywords map to specific methods: They're methods, so you can define your own version for your own types if you can create a better version than the default. The default versions are defined as extension methods, which means that any version you write will probably be a better match.

The standard query operators are general methods. Each of them takes expression trees, so you're signing up for quite a bit of work when you define your own versions of these. You'll more likely define your own versions of other extension methods defined in Queryable or Enumerable. Obvious examples are Take and Skip. You might use Skip to implement any container implementing IList:

public static IEnumerable
Skip(IList sequence,
int count)
{
for (int i = count;
i <>

Note that you can avoid any number of MoveNext calls by using the indexing operator.

Similarly, you could add your own versions of any of the methods defined in System.Linq.Enumerable, where you had a more efficient way of computing the answer.

Earlier in this article, I mentioned that these new keywords are contextual keywords. That's important for backward compatibility with existing programs. These new keywords are only reserved words when they appear in the proper location. These variable declarations are all legal C# 3.0:

int where = 5;
string select = "This is a string";
int var = 7;

That's because the keywords where, select, and var aren't appearing in the proper context for the compiler to view them in their special new way.

This might also seem like language geek trivia, but it's important for most developers. The C# team worked hard to ensure that these new additional features wouldn't break existing programs. By making these new keywords contextual keywords, they were able to introduce new syntax, while minimizing the chance of breaking any existing C# code. None of this new syntax will break your existing code.

No comments: