Numbers and strings#
This page covers a common problem when loading data into Pandas — when Pandas gets confused about whether values in a column are text or numbers.
An example#
import numpy as np
import pandas as pd
pd.set_option('mode.copy_on_write', True)
We return to the example data file that you may have seen in the text encoding page.
You can download the data file from imdblet_latin.csv.
films = pd.read_csv('data/imdblet_latin.csv', encoding='latin1')
films.head()
| Votes | Rating | Title | Year | Decade | |
|---|---|---|---|---|---|
| 0 | 635139 | 8.6 | Léon | 1994 | 1990 |
| 1 | 264285 | 8.1 | The Princess Bride | 1987 | 1980 |
| 2 | 43090 | N/K | Paris, Texas (1984) | 1984 | 1980 |
| 3 | 90434 | 8.3 | Rashômon | 1950 | 1950 |
| 4 | 427099 | 8.0 | X-Men: Days of Future Past | 2014 | 2010 |
Now imagine we are interested in the average rating across these films:
ratings = films['Rating']
ratings.mean()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[3], line 2
1 ratings = films['Rating']
----> 2 ratings.mean()
File /opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/pandas/core/series.py:6570, in Series.mean(self, axis, skipna, numeric_only, **kwargs)
6562 @doc(make_doc("mean", ndim=1))
6563 def mean(
6564 self,
(...) 6568 **kwargs,
6569 ):
-> 6570 return NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)
File /opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/pandas/core/generic.py:12485, in NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)
12478 def mean(
12479 self,
12480 axis: Axis | None = 0,
(...) 12483 **kwargs,
12484 ) -> Series | float:
> 12485 return self._stat_function(
12486 "mean", nanops.nanmean, axis, skipna, numeric_only, **kwargs
12487 )
File /opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/pandas/core/generic.py:12442, in NDFrame._stat_function(self, name, func, axis, skipna, numeric_only, **kwargs)
12438 nv.validate_func(name, (), kwargs)
12440 validate_bool_kwarg(skipna, "skipna", none_allowed=False)
> 12442 return self._reduce(
12443 func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
12444 )
File /opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/pandas/core/series.py:6478, in Series._reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
6473 # GH#47500 - change to TypeError to match other methods
6474 raise TypeError(
6475 f"Series.{name} does not allow {kwd_name}={numeric_only} "
6476 "with non-numeric dtypes."
6477 )
-> 6478 return op(delegate, skipna=skipna, **kwds)
File /opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/pandas/core/nanops.py:147, in bottleneck_switch.__call__.<locals>.f(values, axis, skipna, **kwds)
145 result = alt(values, axis=axis, skipna=skipna, **kwds)
146 else:
--> 147 result = alt(values, axis=axis, skipna=skipna, **kwds)
149 return result
File /opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/pandas/core/nanops.py:404, in _datetimelike_compat.<locals>.new_func(values, axis, skipna, mask, **kwargs)
401 if datetimelike and mask is None:
402 mask = isna(values)
--> 404 result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
406 if datetimelike:
407 result = _wrap_results(result, orig_values.dtype, fill_value=iNaT)
File /opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/pandas/core/nanops.py:720, in nanmean(values, axis, skipna, mask)
718 count = _get_counts(values.shape, mask, axis, dtype=dtype_count)
719 the_sum = values.sum(axis, dtype=dtype_sum)
--> 720 the_sum = _ensure_numeric(the_sum)
722 if axis is not None and getattr(the_sum, "ndim", False):
723 count = cast(np.ndarray, count)
File /opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/pandas/core/nanops.py:1701, in _ensure_numeric(x)
1698 elif not (is_float(x) or is_integer(x) or is_complex(x)):
1699 if isinstance(x, str):
1700 # GH#44008, GH#36703 avoid casting e.g. strings to numeric
-> 1701 raise TypeError(f"Could not convert string '{x}' to numeric")
1702 try:
1703 x = float(x)
TypeError: Could not convert string '8.68.1N/K8.38.08.18.08.48.28.08.08.08.68.68.08.28.48.48.18.38.48.28.58.08.28.18.48.18.68.48.18.78.1' to numeric
The problem#
The problem is that we were expecting our ratings to be numbers, but in fact, they are strings.
We can see what type of thing Pandas has stored by looking at the dtype
attribute of a Series, or the dtypes attribute of a data frame.
films.dtypes
Votes object
Rating object
Title object
Year int64
Decade int64
dtype: object
ratings.dtype
dtype('O')
In fact both these bits of information say the same thing – that the ‘Rating’ column stores things in the “object” or “O” type. This is a general type that can store any Python value. It is the standard type that Pandas uses when storing text.
Why does Pandas use text for the ‘Rating’ column?
A quick look at the first rows gives the answer:
ratings.head()
0 8.6
1 8.1
2 N/K
3 8.3
4 8.0
Name: Rating, dtype: object
The film “Paris, Texas (1984)” has a value “N/K” for the rating. This can’t be a number, so Pandas stored this column in a format that allows it to store “N/K” as text.
If that wasn’t obvious, another way of checking where the problem value is, to apply the function float to the column values.
When we apply a function to a Series, it does this:
For each value in the Series it:
Calls the function, with the value as the single argument.
Collects the new value returned from the function, and appends it to a new Series.
Returns the new Series.
The result is a Series that is the same length as the original series, but where each value in the new series is the result of calling the function on the original value.
Recall that the float function converts the thing you pass into a floating
point value:
v = float('3.14')
v
3.14
type(v)
float
Now we try applying float to the problematic column:
ratings.apply(float)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[9], line 1
----> 1 ratings.apply(float)
File /opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/pandas/core/series.py:4943, in Series.apply(self, func, convert_dtype, args, by_row, **kwargs)
4808 def apply(
4809 self,
4810 func: AggFuncType,
(...) 4815 **kwargs,
4816 ) -> DataFrame | Series:
4817 """
4818 Invoke function on values of Series.
4819
(...) 4934 dtype: float64
4935 """
4936 return SeriesApply(
4937 self,
4938 func,
4939 convert_dtype=convert_dtype,
4940 by_row=by_row,
4941 args=args,
4942 kwargs=kwargs,
-> 4943 ).apply()
File /opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/pandas/core/apply.py:1422, in SeriesApply.apply(self)
1419 return self.apply_compat()
1421 # self.func is Callable
-> 1422 return self.apply_standard()
File /opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/pandas/core/apply.py:1502, in SeriesApply.apply_standard(self)
1496 # row-wise access
1497 # apply doesn't have a `na_action` keyword and for backward compat reasons
1498 # we need to give `na_action="ignore"` for categorical data.
1499 # TODO: remove the `na_action="ignore"` when that default has been changed in
1500 # Categorical (GH51645).
1501 action = "ignore" if isinstance(obj.dtype, CategoricalDtype) else None
-> 1502 mapped = obj._map_values(
1503 mapper=curried, na_action=action, convert=self.convert_dtype
1504 )
1506 if len(mapped) and isinstance(mapped[0], ABCSeries):
1507 # GH#43986 Need to do list(mapped) in order to get treated as nested
1508 # See also GH#25959 regarding EA support
1509 return obj._constructor_expanddim(list(mapped), index=obj.index)
File /opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/pandas/core/base.py:925, in IndexOpsMixin._map_values(self, mapper, na_action, convert)
922 if isinstance(arr, ExtensionArray):
923 return arr.map(mapper, na_action=na_action)
--> 925 return algorithms.map_array(arr, mapper, na_action=na_action, convert=convert)
File /opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/pandas/core/algorithms.py:1743, in map_array(arr, mapper, na_action, convert)
1741 values = arr.astype(object, copy=False)
1742 if na_action is None:
-> 1743 return lib.map_infer(values, mapper, convert=convert)
1744 else:
1745 return lib.map_infer_mask(
1746 values, mapper, mask=isna(values).view(np.uint8), convert=convert
1747 )
File pandas/_libs/lib.pyx:2999, in pandas._libs.lib.map_infer()
ValueError: could not convert string to float: 'N/K'
One way of dealing with this problem is to make a recoding function.
A recoding function is a function that we will apply to a Series. That means that we call the function for every value in the Series. The function argument is the value from the series. The function returns the new value, for a new Series.
def recode_ratings(v):
if v == 'N/K': # Return missing value for 'N/K'
return np.nan
# Otherwise make text value into a float
return float(v)
We test our function:
recode_ratings('8.3')
8.3
recode_ratings('N/K')
nan
We make a new Series by calling the recode function:
new_ratings = ratings.apply(recode_ratings)
new_ratings.head()
0 8.6
1 8.1
2 NaN
3 8.3
4 8.0
Name: Rating, dtype: float64
We can insert this back into a copy of the original data frame:
films_fixed = films
films_fixed.loc[:, 'Rating'] = new_ratings
films_fixed.head()
| Votes | Rating | Title | Year | Decade | |
|---|---|---|---|---|---|
| 0 | 635139 | 8.6 | Léon | 1994 | 1990 |
| 1 | 264285 | 8.1 | The Princess Bride | 1987 | 1980 |
| 2 | 43090 | NaN | Paris, Texas (1984) | 1984 | 1980 |
| 3 | 90434 | 8.3 | Rashômon | 1950 | 1950 |
| 4 | 427099 | 8.0 | X-Men: Days of Future Past | 2014 | 2010 |