Numbers and strings

Numbers and strings#

This page covers a common problem when loading data into Pandas — when Pandas gets confused about whether values in a column are text or numbers.

An example#

import numpy as np
import pandas as pd
pd.set_option('mode.copy_on_write', True)

We return to the example data file that you may have seen in the text encoding page.

You can download the data file from imdblet_latin.csv.

films = pd.read_csv('data/imdblet_latin.csv', encoding='latin1')
films.head()

	Votes	Rating	Title	Year	Decade
0	635139	8.6	Léon	1994	1990
1	264285	8.1	The Princess Bride	1987	1980
2	43090	N/K	Paris, Texas (1984)	1984	1980
3	90434	8.3	Rashômon	1950	1950
4	427099	8.0	X-Men: Days of Future Past	2014	2010

Now imagine we are interested in the average rating across these films:

ratings = films['Rating']
ratings.mean()

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[3], line 2
      1 ratings = films['Rating']
----> 2 ratings.mean()

File /opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/pandas/core/series.py:6570, in Series.mean(self, axis, skipna, numeric_only, **kwargs)
   6562 @doc(make_doc("mean", ndim=1))
   6563 def mean(
   6564     self,
   (...)   6568     **kwargs,
   6569 ):
-> 6570     return NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)

File /opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/pandas/core/generic.py:12485, in NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)
  12478 def mean(
  12479     self,
  12480     axis: Axis | None = 0,
   (...)  12483     **kwargs,
  12484 ) -> Series | float:
> 12485     return self._stat_function(
  12486         "mean", nanops.nanmean, axis, skipna, numeric_only, **kwargs
  12487     )

File /opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/pandas/core/generic.py:12442, in NDFrame._stat_function(self, name, func, axis, skipna, numeric_only, **kwargs)
  12438 nv.validate_func(name, (), kwargs)
  12440 validate_bool_kwarg(skipna, "skipna", none_allowed=False)
> 12442 return self._reduce(
  12443     func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
  12444 )

File /opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/pandas/core/series.py:6478, in Series._reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
   6473     # GH#47500 - change to TypeError to match other methods
   6474     raise TypeError(
   6475         f"Series.{name} does not allow {kwd_name}={numeric_only} "
   6476         "with non-numeric dtypes."
   6477     )
-> 6478 return op(delegate, skipna=skipna, **kwds)

File /opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/pandas/core/nanops.py:147, in bottleneck_switch.__call__.<locals>.f(values, axis, skipna, **kwds)
    145         result = alt(values, axis=axis, skipna=skipna, **kwds)
    146 else:
--> 147     result = alt(values, axis=axis, skipna=skipna, **kwds)
    149 return result

File /opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/pandas/core/nanops.py:404, in _datetimelike_compat.<locals>.new_func(values, axis, skipna, mask, **kwargs)
    401 if datetimelike and mask is None:
    402     mask = isna(values)
--> 404 result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
    406 if datetimelike:
    407     result = _wrap_results(result, orig_values.dtype, fill_value=iNaT)

File /opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/pandas/core/nanops.py:720, in nanmean(values, axis, skipna, mask)
    718 count = _get_counts(values.shape, mask, axis, dtype=dtype_count)
    719 the_sum = values.sum(axis, dtype=dtype_sum)
--> 720 the_sum = _ensure_numeric(the_sum)
    722 if axis is not None and getattr(the_sum, "ndim", False):
    723     count = cast(np.ndarray, count)

File /opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/pandas/core/nanops.py:1701, in _ensure_numeric(x)
   1698 elif not (is_float(x) or is_integer(x) or is_complex(x)):
   1699     if isinstance(x, str):
   1700         # GH#44008, GH#36703 avoid casting e.g. strings to numeric
-> 1701         raise TypeError(f"Could not convert string '{x}' to numeric")
   1702     try:
   1703         x = float(x)

TypeError: Could not convert string '8.68.1N/K8.38.08.18.08.48.28.08.08.08.68.68.08.28.48.48.18.38.48.28.58.08.28.18.48.18.68.48.18.78.1' to numeric

The problem#

The problem is that we were expecting our ratings to be numbers, but in fact, they are strings.

We can see what type of thing Pandas has stored by looking at the dtype attribute of a Series, or the dtypes attribute of a data frame.

films.dtypes

Votes     object
Rating    object
Title     object
Year       int64
Decade     int64
dtype: object

ratings.dtype

dtype('O')

In fact both these bits of information say the same thing – that the ‘Rating’ column stores things in the “object” or “O” type. This is a general type that can store any Python value. It is the standard type that Pandas uses when storing text.

Why does Pandas use text for the ‘Rating’ column?

A quick look at the first rows gives the answer:

ratings.head()

  8.6
  8.1
  N/K
  8.3
  8.0
Name: Rating, dtype: object

The film “Paris, Texas (1984)” has a value “N/K” for the rating. This can’t be a number, so Pandas stored this column in a format that allows it to store “N/K” as text.

If that wasn’t obvious, another way of checking where the problem value is, to apply the function float to the column values.

When we apply a function to a Series, it does this:

For each value in the Series it:
- Calls the function, with the value as the single argument.
- Collects the new value returned from the function, and appends it to a new Series.
Returns the new Series.

The result is a Series that is the same length as the original series, but where each value in the new series is the result of calling the function on the original value.

Recall that the float function converts the thing you pass into a floating point value:

v = float('3.14')
v

3.14

type(v)

float

Now we try applying float to the problematic column:

ratings.apply(float)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[9], line 1
----> 1 ratings.apply(float)

File /opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/pandas/core/series.py:4943, in Series.apply(self, func, convert_dtype, args, by_row, **kwargs)
   4808 def apply(
   4809     self,
   4810     func: AggFuncType,
   (...)   4815     **kwargs,
   4816 ) -> DataFrame | Series:
   4817     """
   4818     Invoke function on values of Series.
   4819 
   (...)   4934     dtype: float64
   4935     """
   4936     return SeriesApply(
   4937         self,
   4938         func,
   4939         convert_dtype=convert_dtype,
   4940         by_row=by_row,
   4941         args=args,
   4942         kwargs=kwargs,
-> 4943     ).apply()

File /opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/pandas/core/apply.py:1422, in SeriesApply.apply(self)
   1419     return self.apply_compat()
   1421 # self.func is Callable
-> 1422 return self.apply_standard()

File /opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/pandas/core/apply.py:1502, in SeriesApply.apply_standard(self)
   1496 # row-wise access
   1497 # apply doesn't have a `na_action` keyword and for backward compat reasons
   1498 # we need to give `na_action="ignore"` for categorical data.
   1499 # TODO: remove the `na_action="ignore"` when that default has been changed in
   1500 #  Categorical (GH51645).
   1501 action = "ignore" if isinstance(obj.dtype, CategoricalDtype) else None
-> 1502 mapped = obj._map_values(
   1503     mapper=curried, na_action=action, convert=self.convert_dtype
   1504 )
   1506 if len(mapped) and isinstance(mapped[0], ABCSeries):
   1507     # GH#43986 Need to do list(mapped) in order to get treated as nested
   1508     #  See also GH#25959 regarding EA support
   1509     return obj._constructor_expanddim(list(mapped), index=obj.index)

File /opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/pandas/core/base.py:925, in IndexOpsMixin._map_values(self, mapper, na_action, convert)
    922 if isinstance(arr, ExtensionArray):
    923     return arr.map(mapper, na_action=na_action)
--> 925 return algorithms.map_array(arr, mapper, na_action=na_action, convert=convert)

File /opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/pandas/core/algorithms.py:1743, in map_array(arr, mapper, na_action, convert)
   1741 values = arr.astype(object, copy=False)
   1742 if na_action is None:
-> 1743     return lib.map_infer(values, mapper, convert=convert)
   1744 else:
   1745     return lib.map_infer_mask(
   1746         values, mapper, mask=isna(values).view(np.uint8), convert=convert
   1747     )

File pandas/_libs/lib.pyx:2999, in pandas._libs.lib.map_infer()

ValueError: could not convert string to float: 'N/K'

One way of dealing with this problem is to make a recoding function.

A recoding function is a function that we will apply to a Series. That means that we call the function for every value in the Series. The function argument is the value from the series. The function returns the new value, for a new Series.

def recode_ratings(v):
    if v == 'N/K':  # Return missing value for 'N/K'
        return np.nan
    # Otherwise make text value into a float
    return float(v)

We test our function:

recode_ratings('8.3')

8.3

recode_ratings('N/K')

nan

We make a new Series by calling the recode function:

new_ratings = ratings.apply(recode_ratings)
new_ratings.head()

  8.6
  8.1
  NaN
  8.3
  8.0
Name: Rating, dtype: float64

We can insert this back into a copy of the original data frame:

films_fixed = films
films_fixed.loc[:, 'Rating'] = new_ratings
films_fixed.head()

	Votes	Rating	Title	Year	Decade
0	635139	8.6	Léon	1994	1990
1	264285	8.1	The Princess Bride	1987	1980
2	43090	NaN	Paris, Texas (1984)	1984	1980
3	90434	8.3	Rashômon	1950	1950
4	427099	8.0	X-Men: Days of Future Past	2014	2010

Numbers and strings

Contents

Numbers and strings#

An example#

The problem#