Ga direct naar


How do I write my own parser? (for JSON)

Friday 05 December 2008 09:00

If no parser is available for the file you need, writing one yourself may be easier than you think. What file-structures are managable? What would be the design of such a parser? How do you make sure it is complete? Here we describe the process for building a JSON parser in C#, and issue the source code.

By Patrick van Bergen

[Download the JSON parser / generator for C#]

The software is subject to the MIT license: you are free to use it in any way you like, but it must keep its license.

For our synchronisation-module (which we use to synchronize data between diverse business applications) we chose JSON for data exchange. JSON is just a little better suited for a PHP web-environment than XML, because:

  • The PHP functions json_encode() and json_decode() allow you to convert data structures from and to JSON strings
  • JSON can be sent directly to the browser in an Ajax request
  • It takes up less space than XML, which is important in server > browser traffic.
  • A JSON string can be composed of only ASCII characters, while still being able to express all UNICODE characters, thus avoiding all possible conversion issues a transport may carry.

So JSON is very convenient for PHP. But of course we wanted to be able to synchronize with Windows applications as well, and because C# is better suited to this environment, this part of the module was written in this language. The .Net framework just didn't have its own JSON parser / encoder and the open-source software written for this task often contained a whole package of classes and constraints and sometimes the JSON implementation wasn't even complete.

We just wanted a single class that could be imported and that used the most basic building blocks of our application: the ArrayList and the Hashtable. Also, all aspects of JSON should would have to be implemented, there should a JSON generator, and of course it should be fast.

More reasons to write our own parser weren't necessary. Writing a parser happens to be a very thing satisfying to do. It is the best way to learn a new programming language thoroughly. Especially if you're using unit-testing to guarantee the parser / generator matches the language specification exactly. JSON's specification is easy to find. The website http://www.json.org/ is as clear as one could wish for.

You start by writing the unit-tests. You should really write all test before starting the implementation, but such patience is seldomly found in a programmer. You can at least start by writing some obvious tests that help you to create a consistent API. This is an example of a simple object-test:

string json;
Hashtable o;
bool success = true;

json = "{\"name\":123,\"name2\":-456e8}";
o = (Hashtable)JSON.JsonDecode(json);
success = success && ((double)o["name"] == 123);
success = success && ((double)o["name2"] == -456e8);

Eventually you should write all tests needed to check all aspects of the language, because your users (other programmers) will assume that the parser just works.

OK. Parsers. Parsers are associated with specialized software: so called compiler compilers (of which Yacc is the most well known). Using this software will make sure that the parser will be fast, but it does not do all the work for you. What's more, it can be even easier to write the entire parser yourself than to do all the preparatoy work for the cc.

The compiler compiler is needed for languages with a high level of ambiguity. A language expression is parsed from-left-to-right. If a language contains many structures that cannot be identified at the start of te parse, it is advisable to use a tool that is able to manage the emerging complexity.

Unambiguous languages are better suitable for building the parser manually, using recursive functions to process the recursive nature of the language. The parser looks ahead one or more tokens to identify the next construct. For JSON it is even sufficient to look ahead a single token. This classifies it as an LL(1) language (see also http://en.wikipedia.org/wiki/LL_parser).

A parser takes as input a string of tokens. Tokens are the most elementary building blocks of a language, like "+", "{", "[", but also complete numbers like "-1.345e5" and strings like "'The scottish highlander looked around.'". The parse-phase is usually preceded by a tokenization phase. In our JSON parser this step is integrated in the parser, because to determine the next token, in almost all cases, it is enough to just read the next character in the string. This saves the allocation of a token table in memory.

The parser takes a string as input and returns a C# datastructure, consisting of ArrayLists, Hashtables, a number of scalar value types and null. The string is processed from left-to-right. An index (pointer) keeps track of the current position in the string at any moment. At each level of the parse process the parser performs these steps:

  • Look ahead 1 token to determine the type of the next construct
  • Choose the function to parse the construct
  • Call this function and integrate the returned value in the construct that is currently built.

A nice example is the recursive function "ParseObject" that parses an object:

protected Hashtable ParseObject(char[] json, ref int index)
{
Hashtable table = new Hashtable();
int token;

// {
NextToken(json, ref index);

bool done = false;
while (!done) {
token = LookAhead(json, index);
if (token == JSON.TOKEN_NONE) {
return null;
} else if (token == JSON.TOKEN_COMMA) {
NextToken(json, ref index);
} else if (token == JSON.TOKEN_CURLY_CLOSE) {
NextToken(json, ref index);
return table;
} else {

// name
string name = ParseString(json, ref index);
if (name == null) {
return null;
}

// :
token = NextToken(json, ref index);
if (token != JSON.TOKEN_COLON) {
return null;
}

// value
bool success = true;
object value = ParseValue(json, ref index, ref success);
if (!success) {
return null;
}

table[name] = value;
}
}

return table;
}

The function is only called if a look ahead has determined that a construct starts with an opening curly brace. So this token may be skipped. Next, the string is parsed just as long as the closing brace is not found, or the end of the string is found (a syntax error, but one that needs to be caught). Between the braces there are a number of "'name': value" pairs, separated by comma's. This algorithm is can be found literally in the function, which makes it very insightful and thus easy to debug. The function builds an ArrayList and returns it to the calling function. The parser mainly consists of these types of functions.

If you create your own parser, you will always need to take into account that the incoming string may be grammatically incorrect. Users expect the parser to be able to tell on which line the error occurred. Our parser only remembers the index, but it also contains an extra function that returns the immediate context of the position of the error, comparable to the error messages that MySQL generates.

If you want to know more about parsers, it is good to know there consists a een standard work on this subject, that recently (2006) saw its second version:

Compilers: principles, techniques, and tools, Aho, A.V., Sethi, R. and Ullman ,J.D. (1986)

« Back

Reactions on "How do I write my own parser? (for JSON)"

1 2 3 4 5 6 Last page
bhagwat sharma
Placed on: 03-17-2010 06:55
Quote
Friturist wrote:
Has this implementation been open sourced? It sounds like the .NET world could use a working JSON parser.
Greg
Placed on: 04-03-2010 11:17
Damn fine code.

I'm using it for persistant hashtable storage. I JSON encode my hash in c# to a string and save as a text file, and then JSON decode the text file file and convert it back to a hash when opening the program. On top of that, it allowed me to update the persistant file via a master hash on the web. No database. Really great!!!

No Donate button? I'll favorite you and come back to throw some amazon gift list stuff at you when you put up a link!!!!

Thanks for some fine work!
garfix
Placed on: 04-03-2010 19:48
Patrick van Bergen
User icon
to be continuum
You're welcome! No need to spend your money on me. You just made my day with the very generous compliment. Thank you! Smile
garfix
Placed on: 04-27-2010 14:57
Patrick van Bergen
User icon
to be continuum
UPDATE:

I made a change to the function "ParseString": it now uses a temporary StringBuilder for the parsed string. This is an important speed improvement when parsing large string values.
Edited
garfix has edited this message on: 04-27-2010 14:57
Rashid
Placed on: 04-27-2010 19:43
Again, this is awesome code. I'm trying to iterate through a hashtable, within another hashtable, recursively.... Not having much luck, how would I do this geting all the keys and values for all the hashtables?


foreach (DictionaryEntry de in myHashtable)
{
Console.WriteLine("{0} \t: {1}", de.Key, de.Value);
.......
........
}

Thanks

Rashid
garfix
Placed on: 04-28-2010 09:09
Patrick van Bergen
User icon
to be continuum
I am not sure I understand your problem, but you may want to look at the function SerializeObject: it iterates through the elements of an hashtable (in order to serialize them). If you have nested hashtables, consider creating a recursive function to go though all their elements.
Edited
garfix has edited this message on: 04-28-2010 09:10
zeko77
Placed on: 04-28-2010 12:02
Sweet, sleek, elegant and it WORKS!!!
great
Andrew Gibson
Placed on: 05-17-2010 18:48
Awesome. Excellent job.
Andrew Gibson
Placed on: 05-19-2010 11:15
Just e-mailed you a little extension I wrote to your code (to info@procurios.nl)
Rwin
Placed on: 05-22-2010 04:14
Hi,

Great work! It looks like the code is not thread safe. So be careful using it in a multithreaded environment.

Error handling might not work as expected.

garfix
Placed on: 05-23-2010 00:06
Patrick van Bergen
User icon
to be continuum
Thanks, Andrew, the extension is quite useful!

Rwin, you are right about the code not being thread safe. I used a singleton for error handling purposes. I had not considered multi-threading environments when I wrote this.

As soon as time permits, I will include both your additions into a new version.
hallem
Placed on: 06-18-2010 18:14
Why does the class need an instance of itself as a static member?
garfix
Placed on: 06-21-2010 12:07
Patrick van Bergen
User icon
to be continuum
To allow you get debug information of the latest operation. I intend to change this to make the code thread safe.
mtighe
Placed on: 06-30-2010 20:24
FYI, on your IsNumeric function, it would be better to use Double.TryParse. Much better than throwing an exception.
rozgo
Placed on: 07-04-2010 10:05
Just found this. Beautifully simple. Exactly what I needed. Funny thing is right before finding this I was trying to make a 100+ file project work with Unity3D. I miss one file solutions like you have no idea.
garfix
Placed on: 07-07-2010 17:32
Patrick van Bergen
User icon
to be continuum
I made the following changes to the code:

1) Made the code thread-safe by removing the instance and the debug function. You can now ask if a decode succeeded by passing a second parameter to JsonDecode. (thanks Rwin, hallem)
2) Corrupted JSON string detection was not perfect. It is much better now.
3) Changed the IsNumeric function (thanks, mtighe)
4) Added the function JsonHydrate to support C# 4 ExpandoObjects. It is described in the article text (many thanks to Andrew Gibson!)
Jakub
Placed on: 07-20-2010 15:31
What is the purpose for "index++;" in line 311 and then "index--;" in line 332 ?
BananielTheSpaniel
Placed on: 07-26-2010 18:47
Hi, I tweaked the class so it is able to use indentation (aka beautification), while encoding. Additionally the order of objects, arrays and strings is preserved within the output-file. It won't be reversed anymore. I confess it is not much of a deal. Big Kudos to Patrick. With my coding skills I wouldn't have been able to accomplish code that slim and functional. I still don't have a web presence where to put the code (albeit there still had to be an ok from you), and your google-mail account is rejecting my mail (attachments?). So if theres a need for it give me a clue where to put the modifications. So long.
Saurabh Kumar
Placed on: 08-04-2010 12:40
What a great code. Thanks a lot for Post..:)
Slawek S
Placed on: 08-06-2010 17:03
Hi! It's great. I use it in my c# projects but ... what about java?

see below and use it if you like without any limitation. It's still Patric's code translated by me to java, not mine.


/**********/
package com.processcrm.util;

import java.util.ArrayList;
import java.util.List;
import java.util.Hashtable;
import java.util.Map;

public final class JSON {

public final int TOKEN_NONE = 0;
public final int TOKEN_CURLY_OPEN = 1;
public final int TOKEN_CURLY_CLOSE = 2;
public final int TOKEN_SQUARED_OPEN = 3;
public final int TOKEN_SQUARED_CLOSE = 4;
public final int TOKEN_COLON = 5;
public final int TOKEN_COMMA = 6;
public final int TOKEN_STRING = 7;
public final int TOKEN_NUMBER = 8;
public final int TOKEN_TRUE = 9;
public final int TOKEN_FALSE = 10;
public final int TOKEN_NULL = 11;

private final int BUILDER_CAPACITY = 2000;

public JSON() {};

// On decoding, this value holds the position at which the parse failed (-1 = no error).
private int lastErrorIndex = -1;
private String lastDecode = "";

/**
* Parses the String json into a value
* @param json JSON String
* @return List, Map, a double, a String, a Boolean or null
*/
public Object decode(String json) {
// save the String for debug information
this.lastDecode = json;
if(json != null) {
char[] charArray = json.toCharArray();
int[] index = new int[]{0};
boolean[] success = new boolean[]{true};
Object value = parseValue(charArray, index, success);
if(success[0]) {
this.lastErrorIndex = -1;
} else {
this.lastErrorIndex = index[0];
}
return value;
} else {
return null;
}
}

/**
* Converts a Map or List Object into a JSON String
* @param json Map or List Object
* @returns A JSON encoded String, or null if Object 'json' is not serializable
*/
public String encode(Object json) {
StringBuilder builder = new StringBuilder(BUILDER_CAPACITY);
boolean success = serializeValue(json, builder);
return (success ? builder.toString() : null);
}

private Object parseValue(char[] json, int[] index, boolean[] success) {
switch(lookAhead(json, index)) {
case TOKEN_STRING:
return parseString(json, index);
case TOKEN_NUMBER:
return parseNumber(json, index);
case TOKEN_CURLY_OPEN:
return parseObject(json, index);
case TOKEN_SQUARED_OPEN:
return parseArray(json, index);
case TOKEN_TRUE:
nextToken(json, index);
return true;
case TOKEN_FALSE:
nextToken(json, index);
return false;
case TOKEN_NULL:
nextToken(json, index);
return null;
case TOKEN_NONE:
break;
}
success[0] = false;
return null;
}

private Map<String,Object> parseObject(char[] json,int[] index) {
Map<String,Object> table = new Hashtable<String,Object>();
int token;
// {
nextToken(json, index);
boolean done = false;
while (!done) {
token = lookAhead(json, index);
if(token == TOKEN_NONE) {
return null;
} else if(token == TOKEN_COMMA) {
nextToken(json, index);
} else if(token == TOKEN_CURLY_CLOSE) {
nextToken(json, index);
return table;
} else {
// name
String name = parseString(json, index);
if(name == null) return null;
// :
token = nextToken(json, index);
if(token != TOKEN_COLON) return null;
// value
boolean[] success = new boolean[]{true};
Object value = parseValue(json, index, success);
if(!success[0]) return null;
table.put(name,value);
}
}
return table;
}

private List<Object> parseArray(char[] json, int[] index) {
List<Object> array = new ArrayList<Object>();
// [
nextToken(json, index);
boolean done = false;
while (!done) {
int token = lookAhead(json, index);
if(token == TOKEN_NONE) {
return null;
} else if(token == TOKEN_COMMA) {
nextToken(json, index);
} else if(token == TOKEN_SQUARED_CLOSE) {
nextToken(json, index);
break;
} else {
boolean[] success = new boolean[]{true};
Object value = parseValue(json, index, success);
if(!success[0]) {
return null;
}
array.add(value);
}
}
return array;
}

private String parseString(char[] json, int[] index) {
String s = "";
char c;
eatWhitespace(json, index);
// "
c = json[index[0]++];
boolean complete = false;
while (!complete) {
if(index[0] == json.length) break;
c = json[index[0]++];
if(c == '"') {
complete = true;
break;
} else if(c == '\\') {
if(index[0] == json.length) break;
c = json[index[0]++];
if(c == '"') {
s += '"';
} else if(c == '\\') {
s += '\\';
} else if(c == '/') {
s += '/';
} else if(c == 'b') {
s += '\b';
} else if(c == 'f') {
s += '\f';
} else if(c == 'n') {
s += '\n';
} else if(c == 'r') {
s += '\r';
} else if(c == 't') {
s += '\t';
} else if(c == 'u') {
int remainingLength = json.length - index[0];
if(remainingLength >= 4) {
// fetch the next 4 chars
char[] unicodeCharArray = new char[4];
for(int i=0;i<4;i++) unicodeCharArray=json[index[0]+i];
// parse the 32 bit hex into an integer codepoint
int codePoint = Integer.parseInt(new String(unicodeCharArray), 16);
// convert the integer codepoint to a unicode char and add to String
s += Character.toChars(codePoint);
// skip 4 chars
index[0] += 4;
} else {
break;
}
}

} else {
s += c;
}
}
if(!complete) s=null;
return s;
}

private double parseNumber(char[] json, int[] index) {
eatWhitespace(json, index);
int lastIndex = getLastIndexOfNumber(json, index);
int charLength = (lastIndex - index[0]) + 1;
char[] numberCharArray = new char[charLength];
for(int i=0;i<charLength;i++) numberCharArray=json[index[0]+i];
index[0] = lastIndex + 1;
return Double.parseDouble(new String(numberCharArray));
}

private int getLastIndexOfNumber(char[] json, int[] index) {
int lastIndex;
for (lastIndex = index[0]; lastIndex < json.length; lastIndex++) {
if("0123456789+-.eE".indexOf(json[lastIndex]) == -1) {
break;
}
}
return lastIndex - 1;
}

private void eatWhitespace(char[] json, int[] index) {
for (; index[0] < json.length; index[0]++) {
if(" \t\n\r".indexOf(json[index[0]]) == -1) {
break;
}
}
}

private int lookAhead(char[] json, int[] index) {
int[] saveIndex = new int[]{index[0]};
return nextToken(json, saveIndex);
}

private int nextToken(char[] json, int[] index) {
eatWhitespace(json, index);
int token=TOKEN_NONE;
while(true) {
if(index[0] == json.length) {
token=TOKEN_NONE;
break;
}
char c = json[index[0]++];
switch (c) {
case '{':
token=TOKEN_CURLY_OPEN;
break;
case '}':
token=TOKEN_CURLY_CLOSE;
break;
case '[':
token=TOKEN_SQUARED_OPEN;
break;
case ']':
token=TOKEN_SQUARED_CLOSE;
break;
case ',':
token=TOKEN_COMMA;
break;
case '"':
token=TOKEN_STRING;
break;
case '0':
case '1':
case '2':
case '3':
case '4':
case '5':
case '6':
case '7':
case '8':
case '9':
case '-':
token=TOKEN_NUMBER;
break;
case ':':
token=TOKEN_COLON;
break;
default:
token=TOKEN_NONE;
break;
}
if(token!=TOKEN_NONE) break;
index[0]--;
int remainingLength = json.length - index[0];
// false
if(remainingLength >= 5) {
if(json[index[0]] == 'f' &&
json[index[0] + 1] == 'a' &&
json[index[0] + 2] == 'l' &&
json[index[0] + 3] == 's' &&
json[index[0] + 4] == 'e') {
index[0] += 5;
token=TOKEN_FALSE;
break;
}
}
// true
if(remainingLength >= 4) {
if(json[index[0]] == 't' &&
json[index[0] + 1] == 'r' &&
json[index[0] + 2] == 'u' &&
json[index[0] + 3] == 'e') {
index[0] += 4;
token=TOKEN_TRUE;
break;
}
}
// null
if(remainingLength >= 4) {
if(json[index[0]] == 'n' &&
json[index[0] + 1] == 'u' &&
json[index[0] + 2] == 'l' &&
json[index[0] + 3] == 'l') {
index[0] += 4;
token=TOKEN_NULL;
break;
}
}

token=TOKEN_NONE;
break;
}
return token;
}

private boolean serializeObject(Map<String,Object> anObject, StringBuilder builder) {
builder.append("{");
boolean first = true;
for(String key: anObject.keySet()) {
Object value = anObject.get(key);
if(!first) {
builder.append(", ");
}
serializeString(key, builder);
builder.append(":");
if(!serializeValue(value, builder)) {
return false;
}
first=false;
}
builder.append("}");
return true;
}

private boolean serializeArray(List<Object> anArray, StringBuilder builder) {
builder.append("[");
boolean first = true;
for (int i = 0; i < anArray.size(); i++) {
Object value = anArray.get(i);
if(!first) {
builder.append(", ");
}
if(!serializeValue(value, builder)) {
return false;
}
first = false;
}
builder.append("]");
return true;
}

@SuppressWarnings("unchecked")
private boolean serializeValue(Object value, StringBuilder builder) {
if(value instanceof String) {
serializeString((String)value, builder);
} else if(value instanceof Map) {
serializeObject((Map<String,Object>)value, builder);
} else if(value instanceof List) {
serializeArray((List<Object>)value, builder);
} else if((value instanceof Number)) {
serializeNumber(new Double(((Number)value).toString()), builder);
} else if((value instanceof Boolean) && ((Boolean)value == true)) {
builder.append("true");
} else if((value instanceof Boolean) && ((Boolean)value == false)) {
builder.append("false");
} else if(value == null) {
builder.append("null");
} else {
return false;
}
return true;
}

private void serializeString(String aString, StringBuilder builder) {
builder.append("\"");
char[] charArray = aString.toCharArray();
for(int i = 0; i < charArray.length; i++) {
char c = charArray;
if(c == '"') {
builder.append("\\\"");
} else if(c == '\\') {
builder.append("\\\\");
} else if(c == '\b') {
builder.append("\\b");
} else if(c == '\f') {
builder.append("\\f");
} else if(c == '\n') {
builder.append("\\n");
} else if(c == '\r') {
builder.append("\\r");
} else if(c == '\t') {
builder.append("\\t");
} else {
int codepoint = new Integer(c);
if((codepoint >= 32) && (codepoint <= 126)) {
builder.append(c);
} else {
String u="0000"+Integer.toHexString(codepoint);
builder.append("\\u" + u.substring(u.length()-4));
}
}
}
builder.append("\"");
}

private void serializeNumber(Double number, StringBuilder builder) {
builder.append(number.toString());
}

/**
* On decoding, this function returns the position at which the parse failed (-1 = no error).
*
* @return
*/
public boolean lastDecodeSuccessful() {
return (lastErrorIndex == -1);
}

/**
* On decoding, this function returns the position at which the parse failed (-1 = no error).
* @return
*
*/
public int getLastErrorIndex() {
return lastErrorIndex;
}

/**
* If a decoding error occurred, this function returns a piece of the JSON String
* at which the error took place. To ease debugging.
* @return
*/
public String getLastErrorSnippet() {
if(this.lastErrorIndex == -1) {
return "";
} else {
int startIndex = lastErrorIndex - 5;
int endIndex = lastErrorIndex + 15;
if(startIndex < 0) {
startIndex = 0;
}
if(endIndex >= lastDecode.length()) {
endIndex = lastDecode.length() - 1;
}
return this.lastDecode.substring(startIndex);
}
}

@SuppressWarnings("unchecked")
public boolean serializeObjectOrArray(Object objectOrArray, StringBuilder builder) {
if(objectOrArray instanceof Map) {
return serializeObject((Map<String,Object>)objectOrArray, builder);
} else if(objectOrArray instanceof List) {
return serializeArray((List<Object>)objectOrArray, builder);
} else {
return false;
}
}

}
Chaz
Placed on: 08-07-2010 04:57
Thought I should point this out, the Char class has static members for checking whitespace, numbers, letters, and numbers and letters at once. I see you have a lot of char checking functions here and there, and thought you maximize it a bit.
garfix
Placed on: 08-09-2010 10:16
Patrick van Bergen
User icon
to be continuum
@Slawek S: Nice work! Just two remarks:
- Think about placing your code in a separate file and storing just the link in your post. It makes the code more readable (whitespace) and maintainable
- Would you be willing to update the code to the latest changes in the API of the C# class?

@Jakub: The index is returned to the calling function. It is increased when one of the 1-character tokens is found. If not, it needs to be decreased again for other tests of other tokens.

@BananielTheSpaniel: Please send your modifications inline in a mail to info@procurios.nl, c/o Patrick van Bergen. I want to look at them Smile I don't know why the attachment wouldn't pass.

@Chaz: I will look into that, thanks.
Vamsi Krishna
Placed on: 08-11-2010 13:08
wow, this looks promising, I' don't feel comfortable working with XML and linq, they both are ambiguous

Looking for an alternative since long time, hope this solves my problems

Thank and congratulations
Charlie
Placed on: 08-12-2010 17:41
Many thanks! I was pulling my hair out over Json.NET.
garfix
Placed on: 08-16-2010 12:02
Patrick van Bergen
User icon
to be continuum
@Chaz, I looked at it, but couldn't find a function that would be helpful. Perhaps an example?
1 2 3 4 5 6 Last page

Log in to comment on news articles.

Procurios zoekt PHP webdevelopers. Werk aan het Procurios Webplatform en klantprojecten! Zie http://www.slimmerwerkenbijprocurios.nl/.


Hello!

We are employees at Procurios, a full-service webdevelopment company located in the Netherlands. We are experts at building portals, websites, intranets and extranets, based on an in-house developed framework. You can find out more about Procurios and our products, might you be interested.

This weblog is built and maintained by us. We love to share our ideas, thoughts and interests with you through our weblog. If you want to contact us, please feel free to use the contact form!


Showcase

  • Klantcase: Bestseller
  • Klantcase: de ChristenUnie
  • Klantcase: Evangelische Omroep
  • Klantcase: de Keurslager
  • Klantcase: New York Pizza
  • Klantcase: Verhage

Snelkoppelingen