Tuesday, July 14, 2009

RegEx Tokenizer

The follow code snippet is adapted from Fredrik Lundh's effbot.org entry
Using Regular Expressions for Lexical Analysis

Say you want to tokenize an expression such as "(3+5)*10":
#!/usr/bin/env python
'''
Use regex to tokenize a string expression.
adapted from:
http://effbot.org/zone/xml-scanner.htm
'''
import re

reg_token = re.compile(r"""
\s* #skip whitespace
([0-9\.]+| #one or more digits or '.'
aka floats or ints
\w+| #words
[+\-*/!^%&|]{1,2}| #operators
.) #any character except newline
""",
re.VERBOSE)

def tokenize(expr):
'''
Returns a list of tokens for an expression string.
Allows operators +-*/!^%&|
Treats doubled operator e.g., **, ++ as single token
'''
def v_token(obj):
try:
if '.' in obj:
return float(obj)
else:
return int(obj)
except:
return obj

return [v_token(tkn.group()) for tkn
in reg_token.finditer(expr)]


Let's test on
some expressions

expr = ["(3+7)*90", # basic
"(3+7.1)*90", # has floats
"(3+7.1)*90*alpha", # has variables
"(3+7.1)*90*alpha, g", # invalid expression, tokenize and leave to parser
"(5.0 - 3.2)/6*9", # other forms
"b = 2 + a*10",
"x = \n x**2", #picks up **, ++ as a token
"i++",
""
]

for exp in expr:
tkns = tokenize(exp)
print("\nExpression: %s\nTokens: %s " % (exp, tkns))

Gives us...
Expression: (3+7)*90
Tokens: ['(', 3, '+', 7, ')', '*', 90]

Expression: (3+7.1)*90
Tokens: ['(', 3, '+', 7.0999999999999996, ')', '*', 90]

Expression: (3+7.1)*90*alpha
Tokens: ['(', 3, '+', 7.0999999999999996, ')', '*', 90, '*', 'a', 'l', 'p', 'h', 'a']

Expression: (3+7.1)*90*alpha, g
Tokens: ['(', 3, '+', 7.0999999999999996, ')', '*', 90, '*', 'a', 'l', 'p', 'h', 'a', ',', ' g']

Expression: (5.0 - 3.2)/6*9
Tokens: ['(', 5.0, ' -', 3.2000000000000002, ')', '/', 6, '*', 9]

Expression: b = 2 + a*10
Tokens: ['b', ' =', 2, ' +', ' a', '*', 10]

Expression: x =
x**2
Tokens: ['x', ' =', ' \n x', '**', 2]

Expression: i++
Tokens: ['i', '++']

Expression:
Tokens: []


code snippet is at dzone: http://snippets.dzone.com/user/bondgeek



No comments:

Post a Comment