Time Travel: 2014

Chapter 202 The Cycle of Crazy Matryoshka Dolls

Eve Carly continued to state her views:

“...I think it is indeed a good idea to introduce vectors into the measurement of semantic text similarity.

But after the intervention of vectors, it was like opening Pandora's box.

Vectors are used when processing some semantically complex text information.

It is extremely easy to form some high-dimensional spaces, causing dimension explosion.

When this happens, the application scenario often becomes extremely bad.

The problem of dimensionality explosion often arises.

In fact, the problem of dimensionality explosion has already greatly restricted our research.

Dear Lin, what are your thoughts on this issue? "

Lin Hui said: “Dimension explosion is mainly a problem that is difficult to deal with in high dimensions.

In this case, why not consider dimensionality reduction from high dimensions? "

Lin Hui's tone was so calm and calm.

As if describing a natural thing.

Dimensionality reduction? How to reduce the dimensionality of high-dimensional things? ?

I listened to the information transmitted by the interpreter.

Eve Carly felt like she was going to vomit blood.

She kind of wants to learn Chinese.

She didn't know that Lin Hui's original intention was to transform high dimensions into low dimensions.

Or is it that when Lin Hui expressed it, he was talking about converting something high-dimensional into a low-dimensional thing, but something was omitted when the translation was conveyed.

It would be really bad if some important terms were omitted.

In the end, what Lin Hui wants to express is to convert high-dimensional data into low-dimensional data?

Or do you mean converting a high-dimensional model into a low-dimensional model?

Or does it have some other meaning?

Eve Carly wanted to ask.

But considering Lin Hui’s previous thoughtful actions for Mina Kali.

Eve Carly was not good. This kind of thing made the translator brought by Lin Hui feel uneasy.

Think carefully about the meaning of Lin Hui's words.

First of all, Eve Carly felt that what Lin Hui wanted to say was not to reduce high-dimensional data to low-dimensional data.

If high-dimensional data appears when performing natural language processing.

When analyzing high-dimensional data, dimensionality reduction can indeed be performed.

Dimensionality reduction must also be carried out!

Although high-dimensional data models collect many data points.

But the collected data is usually spread across an extremely dispersed and vast high-dimensional space.

In this case, many statistical methods are difficult to apply to high-dimensional data.

This is also one of the reasons why "dimensional disaster" exists.

When encountering this kind of dimensionality disaster, high-dimensional data is difficult to process without dimensionality reduction.

(ps:...people with a lot of talent in mathematics can also master high dimensions)

As a method of data denoising simplification, dimensionality reduction is helpful in processing most modern machine learning data.

By reducing the dimensionality of the data, this complex and thorny problem can theoretically be made simpler and easier.

The so-called dimensionality reduction in the field of machine learning refers to the use of a certain mapping method.

Map the data points in the original high-dimensional space to the low-dimensional space.

This is done to remove noise while preserving the low-dimensional data of the information of interest.

This is very helpful for researchers to understand the structure and patterns hidden in the original high-dimensional data.

Raw high-dimensional data often contain observations of many irrelevant or redundant variables.

Dimensionality reduction can be regarded as a method of latent feature extraction.

Dimensionality reduction is a method often used for data compression, data exploration, and data visualization.

Having said that, dimensionality reduction is not like throwing a two-way foil as described in science fiction books.

Dimensionality reduction is an extremely troublesome thing!

When choosing a dimensionality reduction method, one has to consider many factors.

The first thing to consider is the nature of the input data.

For example, for continuous data, categorical data, count data, and distance data, they will need to use different dimensionality reduction methods.

Consideration of the nature and resolution of the data is very important.

If dimensionality reduction is carried out without considering the nature of the input data, these high-dimensional models can be made low-dimensional.

It is very likely that the originally discrete data will be directly "blurred" together.

This situation is even worse than high-dimensional discreteness.

before applying formal dimensionality reduction techniques.

Proper preprocessing of high-dimensional data is also required.

After all, not all data is sample data.

And sometimes when preprocessing is performed.

The best preprocessing method is to introduce dimensionality reduction.

This creates a crazy matryoshka cycle.

All in all, dimensionality reduction of high-dimensional data is a super troublesome thing.

In actual processing, researchers in the field of natural language processing generally try their best to avoid high-dimensional explosion.

Instead of waiting for high-dimensional data to appear before performing low-dimensional processing.

To some extent, low-dimensional processing of high-dimensional data is more like a necessary and extremely troublesome remedy.

There are a lot of things that are just enough trouble to say goodbye to.

Complicated processes mean they are error-prone.

The form of beautiful things should be simple.

Just like Euler's formula.

Because of this, Eve Carly felt that what a genius like Lin Hui wanted to express was definitely not to reduce high-dimensional data to low-dimensional data.

If what Lin Hui wants to express is not to manipulate high-dimensional data.

But are you talking about making a fuss about the traditional vector space model?

Convert a high-dimensional vector space model to a lower-dimensional space vector model?

This idea is not bad.

But this kind of attempt has not been done before.

Someone tried it a long time ago.

As early as the late last century, someone proposed the latent semantic analysis model.

The latent semantic analysis model is proposed based on the vector space model (VSM).

The basic idea of ​​the latent semantic analysis model is to obtain the spatial vector representation of the text.

Through singular value decomposition, high-dimensional and sparse space vectors are mapped into low-dimensional latent semantic space.

After obtaining low-dimensional text vectors and word vectors.

Metrics such as cosine similarity are then used to calculate the semantic similarity between texts.

The essential idea of ​​latent semantic analysis is to remove noise in the original matrix through dimensionality reduction, thereby improving calculation accuracy.

Although this idea is good, this method is not universal.

This is because the latent semantic analysis model uses singular value decomposition in the process of building the model.

This approach increases the computational complexity and makes it less portable.

After this method was proposed.

It's not like no one has tried to improve on this approach.

It was also the end of the last century.

Some researchers have proposed a probabilistic latent semantic analysis model.

This model is based on probability rather than singular value decomposition.

The main difference between this model and the latent semantic analysis model is the addition of a topic layer.

Then the expectation maximization algorithm is used to train the topic and find a probabilistic latent topic model.

This is used to predict observation data in text space vectors.

Tap the screen to use advanced tools Tip: You can use left and right keyboard keys to browse between chapters.

You'll Also Like