不管是谁,只要处理过由用户提交的调查数据,就能明白这种乱七八糟的数据是怎么一回事。为了得到一组能用于分析工作的格式统一的字符串,需要做很多事情:去除空白符、删除各种标点符号、正确的大写格式等。做法之一是使用内建的字符串方法和正则表达式re模块:

一般写法

states = [' Alabama ', 'Georgia!', 'Georgia', 'georgia', 'FlOrIda', 'south carolina##', 'West virginia?']import redef clean_strings(strings): # 一般对数据的处理步骤 result = [] for value in strings: value = value.strip() value = re.sub('[!#?]', '', value) value = value.title() result.append(value) return resultIn [173]: clean_strings(states)Out[173]: ['Alabama', 'Georgia', 'Georgia', 'Georgia', 'Florida', 'South Carolina', 'West Virginia']推荐写法

def remove_punctuation(value): return re.sub('[!#?]', '', value)clean_ops = [str.strip, remove_punctuation, str.title] # 函数也是对象def clean_strings(strings, ops): result = [] for value in strings: for function in ops: value = function(value) result.append(value) return resultIn [175]: clean_strings(states, clean_ops)Out[175]: ['Alabama', 'Georgia', 'Georgia', 'Georgia', 'Florida', 'South Carolina', 'West Virginia']# 或者In [176]: for x in map(remove_punctuation, states): # .....: print(x)Alabama GeorgiaGeorgiageorgiaFlOrIdasouth carolinaWest virginia