## How TfidfVectorizer in scikit-learn : ValueError: np.nan is an invalid document Error Occurs?

## How To Solve TfidfVectorizer in scikit-learn : ValueError: np.nan is an invalid document Error ?

## Solution 1

You need to convert the dtype `object`

to `unicode`

string as is clearly mentioned in the traceback.

x = v.fit_transform(df['Review'].values.astype('U')) ## Even astype(str) would work

From the Doc page of TFIDF Vectorizer:

fit_transform(raw_documents, y=None)

Parameters: raw_documents : iterable

an iterable which yields eitherstr,unicodeorfile objects

## Solution 2

I find a more efficient way to solve this problem.

x = v.fit_transform(df['Review'].apply(lambda x: np.str_(x)))

Of course you can use `df['Review'].values.astype('U')`

to convert the entire Series. But I found using this function will consume much more memory if the Series you want to convert is really big. (I test this with a Series with 800k rows of data, and doing this `astype('U')`

will consume about 96GB of memory)

Instead, if you use the lambda expression to only convert the data in the Series from `str`

to `numpy.str_`

, which the result will also be accepted by the `fit_transform`

function, this will be faster and will not increase the memory usage.

I’m not sure why this will work because in the Doc page of TFIDF Vectorizer:

fit_transform(raw_documents, y=None)

Parameters: raw_documents : iterable

an iterable which yields either str, unicode or file objects

But actually this iterable must yields `np.str_`

instead of `str`

.

**Summery**

It's all About this issue. Hope all solution helped you a lot. Comment below Your thoughts and your queries. Also, Comment below which solution worked for you?

